R is a dialect of the S language. It is a case-sensitive, interpreted language. You can enter commands one at a time at the command prompt (>) or run a set of commands from a source file. There is a wide variety of data types, including vectors (numerical, character, logical), matrices, data frames, and lists. Most functionality is provided through built-in and user-created functions and all data objects are kept in memory during an interactive session. Basic functions are available by default. Other functions are contained in packages that can be attached to a current session as needed.
R is a case sensitive language. FOO, Foo, and foo are three different objects!
This section describes working with the R interface. A key skill to using R effectively is learning how to use the built-in help system. Other sections describe the working environment, inputting programs and outputting results, installing new functionality through packages, GUIs that have been developed for R, customizing the environment, producing high quality output, and running programs in batch. A fundamental design feature of R is that the output from most functions can be used as input to other functions. This is described in reusing results. ## The Workspace
The workspace is your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, functions). At the end of an R session, the user can save an image of the current workspace that is automatically reloaded the next time R is started. Commands are entered interactively at the R user prompt. Up and down arrow keys scroll through your command history.
You will probably want to keep different projects in different physical directories. Here are some standard commands for managing your workspace.
getwd() # print the current working directory - cwd
## [1] "/Users/bherault/Library/Mobile Documents/com~apple~CloudDocs/Bruno/Teaching/INPHB/ED"
ls() # list the objects in the current workspace
## character(0)
#setwd(mydirectory) change to mydirectory
#setwd("c:/docs/mydir") note / instead of \ in windows
#setwd("/usr/rob/mydir") on linux
view and set options for the session
help(options) # learn about available options
options(digits=3) # number of digits to print on output
work with your previous commands
# history() # display last 25 commands
# history(max.show=Inf) # display all previous commands
save your command history
# savehistory(file="myfile") # default is ".Rhistory"
recall your command history
#loadhistory(file="myfile") # default is ".Rhistory"
load a workspace into the current session if you don’t specify the path, the cwd is assumed
#load("myfile.RData")
#q() # quit R. You will be prompted to save the workspace.
Important Note to Windows Users: R gets confused if you use a path in your code like:
# c:\mydocuments\myfile.txt
This is because R sees “" as an escape character. Instead, use:
# c:\\mydocuments\\myfile.txt # c:/mydocuments/myfile.txt
Either will work.
By default, launching R starts an interactive session with input from the keyboard and output to the screen. However, you can have input come from a script file (a file containing R commands) and direct output to a variety of destinations.
The source() function runs a script in the current session. If the filename does not include a path, the file is taken from the current working directory.
input a script By default, launching R starts an interactive session with input from the keyboard and output to the screen. However, you can have input come from a script file (a file containing R commands) and direct output to a variety of destinations.
# source("myfile")
The sink() function defines the direction of the output.
#sink()
Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.
.libPaths() # get library location
## [1] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library"
library() # see all packages installed
search() # see packages currently loaded
## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"
You can expand the types of analyses you do by adding other packages. A complete list of contributed packages is available from CRAN.
Follow these steps: Download and install a package (you only need to do this once). To use the package, invoke the library(package) command to load it into the current session. (You need to do this once in each session, unless you customize your environment to automatically load it each time.)
On MS Windows: Choose Install Packages from the Packages menu. Select a CRAN Mirror. (e.g. Norway) Select a package. (e.g. boot) Then use the library(package) function to load it for use.
On Linux: Download the package of interest as a compressed file. At the command prompt, install it using R CMD INSTALL [options] [l-lib] pkgs Use the library(package) function within R to load it for use in the session.
Unlike SAS, which has DATA and PROC steps, R has data structures (vectors, matrices, arrays, data frames) that you can operate on through functions that perform statistical analyses and create graphs. In this way, R is similar to PROC IML.
This section describes how to enter or import data into R, and how to prepare it for use in statistical analyses. Topics include R data structures, importing data (from Excel, SPSS, SAS, Stata, and ASCII Text Files), entering data from the keyboard, creating an interface with a database management system, exporting data (to Excel, SPSS, SAS, Stata, and Tab Delimited Text Files), annotating data (with variable labels and value labels), and listing data. In addition, methods for handling missing values and date values are presented.
R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists. ###Vectors
a <- c(1,2,5.3,6,-2,4) # numeric vector
a
## [1] 1.0 2.0 5.3 6.0 -2.0 4.0
b <- c("one","two","three") # character vector
b
## [1] "one" "two" "three"
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
c
## [1] TRUE TRUE TRUE FALSE TRUE FALSE
Refer to elements of a vector using subscripts.
a[c(2,4)] # 2nd and 4th elements of vector
## [1] 2 6
All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. The general format is
mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE, dimnames=list(char_vector_rownames, char_vector_colnames))
byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates that the matrix should be filled by columns (the default). dimnames provides optional labels for the columns and rows.
generates 5 x 4 numeric matrix
y<-matrix(1:20, nrow=5,ncol=4)
y
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
z<-matrix(1:20, nrow=5,ncol=4, byrow=T)
z
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
## [4,] 13 14 15 16
## [5,] 17 18 19 20
another example
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames, cnames))
mymatrix
## C1 C2
## R1 1 26
## R2 24 68
How to identify rows, columns or elements using subscripts.
z[,4] # 4th column of matrix Z
## [1] 4 8 12 16 20
z[3,] # 3rd row of matrix Z
## [1] 9 10 11 12
z[2:4,1:3] # rows 2,3,4 of columns 1,2,3
## [,1] [,2] [,3]
## [1,] 5 6 7
## [2,] 9 10 11
## [3,] 13 14 15
Arrays are similar to matrices but can have more than two dimensions. See help(array) for details.
A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.).
d <- c(1,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata <- data.frame(d,e,f)
names(mydata) <-c("ID","Color","Passed") # variable names
mydata
str(mydata)
## 'data.frame': 4 obs. of 3 variables:
## $ ID : num 1 2 3 4
## $ Color : Factor w/ 2 levels "red","white": 1 2 1 NA
## $ Passed: logi TRUE TRUE TRUE FALSE
There are a variety of ways to identify the elements of a data frame.
mydata[2:3] # columns 2,3 of data frame
mydata[c("ID","Passed")] #columns ID and Age from data frame
mydata$Passed # variable x1 in the data frame
## [1] TRUE TRUE TRUE FALSE
An ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name.
example of a list with 4 components - a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
w
## $name
## [1] "Fred"
##
## $mynumbers
## [1] 1.0 2.0 5.3 6.0 -2.0 4.0
##
## $mymatrix
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
##
## $age
## [1] 5.3
example of a list containing two lists
v <- c(w,w)
v
## $name
## [1] "Fred"
##
## $mynumbers
## [1] 1.0 2.0 5.3 6.0 -2.0 4.0
##
## $mymatrix
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
##
## $age
## [1] 5.3
##
## $name
## [1] "Fred"
##
## $mynumbers
## [1] 1.0 2.0 5.3 6.0 -2.0 4.0
##
## $mymatrix
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
##
## $age
## [1] 5.3
Identify elements of a list using the [[]] convention.
w[[3]] # 2nd component of the list
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
Tell R that a variable is nominal by making it a factor. The factor stores the nominal values as a vector of integers in the range [ 1… k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.
variable gender with 20 “male” entries and 30 “female” entries
gender <- c(rep("male",20), rep("female", 30))
gender <- factor(gender)
gender
## [1] male male male male male male male male male male
## [11] male male male male male male male male male male
## [21] female female female female female female female female female female
## [31] female female female female female female female female female female
## [41] female female female female female female female female female female
## Levels: female male
stores gender as 20 1s and 30 2s and associates 1=female, 2=male internally (alphabetically). R now treats gender as a nominal variable
summary(gender)
## female male
## 30 20
An ordered factor is used to represent an ordinal variable. Variable rating coded as “large”, “medium”, “small’. rating <- ordered(rating) recodes rating to 1,2,3 and associates 1=large, 2=medium, 3=small internally. R now treats rating as ordinal
R will treat factors as nominal variables and ordered factors as ordinal variables in statistical proceedures and graphical analyses. You can use options in the factor() and ordered() functions to control the mapping of integers to strings (overiding the alphabetical ordering). You can also use factors to create value labels.
length(w) # number of elements or components
## [1] 4
str(w) # structure of an object
## List of 4
## $ name : chr "Fred"
## $ mynumbers: num [1:6] 1 2 5.3 6 -2 4
## $ mymatrix : int [1:5, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
## $ age : num 5.3
class(w) # class or type of an object
## [1] "list"
names(w) #names
## [1] "name" "mynumbers" "mymatrix" "age"
c(a,a) # combine objects into a vector
## [1] 1.0 2.0 5.3 6.0 -2.0 4.0 1.0 2.0 5.3 6.0 -2.0 4.0
cbind(a, a) # combine objects as columns
## a a
## [1,] 1.0 1.0
## [2,] 2.0 2.0
## [3,] 5.3 5.3
## [4,] 6.0 6.0
## [5,] -2.0 -2.0
## [6,] 4.0 4.0
rbind(a, a) # combine objects as rows
## [,1] [,2] [,3] [,4] [,5] [,6]
## a 1 2 5.3 6 -2 4
## a 1 2 5.3 6 -2 4
a # prints the object a
## [1] 1.0 2.0 5.3 6.0 -2.0 4.0
ls() # list current objects
## [1] "a" "b" "c" "cells" "cnames" "d"
## [7] "e" "f" "gender" "mydata" "mymatrix" "rnames"
## [13] "v" "w" "y" "z"
rm(v) # delete an object
ls() # without v
## [1] "a" "b" "c" "cells" "cnames" "d"
## [7] "e" "f" "gender" "mydata" "mymatrix" "rnames"
## [13] "w" "y" "z"
Importing data into R is fairly simple. For Stata and Systat, use the foreign package. For SPSS and SAS I would recommend the Hmisc package for ease and functionality. Example of importing data are provided below.
# first row contains variable names, comma is separator
# assign the variable id to row names
# note the / instead of \ on mswindows systems
#mydata <- read.table("c:/mydata.csv", header=TRUE, sep=",", row.names="id")
(To practice importing a csv file, try import the Paracou dataset using read.csv2())
paracou <- read.csv2("paracou2015.csv")
head(paracou)
str(paracou)
## 'data.frame': 19 obs. of 20 variables:
## $ NomForet : Factor w/ 1 level "Paracou": 1 1 1 1 1 1 1 1 1 1 ...
## $ Libelle : int 1 1 1 1 1 1 1 1 1 1 ...
## $ NumCarre : int 1 1 1 1 1 1 1 1 1 1 ...
## $ NumArbre : int 2974 2973 2972 2971 2970 2969 2968 2967 2966 2965 ...
## $ X : num 27.5 15 62 34.5 65 ...
## $ Y : num 190 188 182 184 201 ...
## $ Xutm : num 285114 285102 285148 285122 285146 ...
## $ Yutm : num 583051 583046 583052 583047 583071 ...
## $ Annee : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ Circonference: num 36 34 32 38 32.5 33.5 32.5 34.5 33 32 ...
## $ MesuMort : Factor w/ 1 level "VRAI": 1 1 1 1 1 1 1 1 1 1 ...
## $ idCodeMesure : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Echelle : Factor w/ 1 level "FAUX": 1 1 1 1 1 1 1 1 1 1 ...
## $ idVern : int 751 606 712 623 629 802 105 403 616 404 ...
## $ NomVern : Factor w/ 16 levels "baaka koko","bakuman",..: 12 13 9 5 2 16 4 10 6 11 ...
## $ idTaxon : int 33679 33679 33679 33679 33679 33679 33679 33679 33679 33679 ...
## $ Famille : Factor w/ 1 level "Indet.": 1 1 1 1 1 1 1 1 1 1 ...
## $ Genre : Factor w/ 1 level "Indet.Indet.": 1 1 1 1 1 1 1 1 1 1 ...
## $ Espece : Factor w/ 1 level "Indet.": 1 1 1 1 1 1 1 1 1 1 ...
## $ idArbre : int 249619 249618 249617 249616 249615 249614 249613 249612 249611 249610 ...
One of the best ways to read an Excel file is to export it to a comma delimited file and import it using the method above. Alternatively you can use the xlsx package to access Excel files. The first row should contain variable/column names.
Read in the first worksheet from the workbook myexcel.xlsx first row contains variable names
#library(xlsx)
#mydata <- read.xlsx("c:/myexcel.xlsx", 1)
Read in the worksheet named mysheet
#mydata <- read.xlsx("c:/myexcel.xlsx", sheetName = "mysheet")
Usually you will obtain a data frame by importing it from SAS, SPSS, Excel, Stata, a database, or an ASCII file. To create it interactively, you can do something like the following.
age <- c(25, 30, 56)
gender <- c("male", "female", "male")
weight <- c(160, 110, 220)
mydata <- data.frame(age,gender,weight)
mydata
You can also use R’s built in spreadsheet to enter the data interactively, as in the following example.
#mydata <- data.frame(age=numeric(0), gender=character(0), weight=numeric(0))
#mydata <- edit(mydata)
note that without the assignment in the line above, the edits are not saved!
Exporting Data There are numerous methods for exporting R objects into other formats. For SPSS, SAS and Stata, you will need to load the foreign packages. For Excel, you will need the xlsReadWrite package.
write.table(mydata, "mydata", sep="\t")
# library(xlsx)
# write.xlsx(mydata, "mydata")
Be careful with Rjava (64 or 32 bits)
There are a number of functions for listing the contents of an object or dataset.
List objects in the working environment
ls()
## [1] "a" "age" "b" "c" "cells" "cnames"
## [7] "d" "e" "f" "gender" "mydata" "mymatrix"
## [13] "paracou" "rnames" "w" "weight" "y" "z"
List the variables in mydata
names(mydata)
## [1] "age" "gender" "weight"
List the structure of mydata
str(mydata)
## 'data.frame': 3 obs. of 3 variables:
## $ age : num 25 30 56
## $ gender: Factor w/ 2 levels "female","male": 2 1 2
## $ weight: num 160 110 220
List levels of factor gender in mydata
levels(mydata$gender)
## [1] "female" "male"
Dimensions of an object
dim(mydata)
## [1] 3 3
Class of an object (numeric, matrix, data frame, etc)
class(mydata)
## [1] "data.frame"
Print mydata
mydata
Print first 10 rows of paracou
head(paracou, n=10)
Print last 5 rows of paracou
tail(paracou, n=5)
In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same symbol for character and numeric data.
is.na(2) # returns TRUE of x is missing
## [1] FALSE
y <- c(1,2,3,NA)
is.na(y) # returns a vector (F F F T)
## [1] FALSE FALSE FALSE TRUE
-> recode 99 to missing for variable v1 -> select rows where v1 is 99 and recode column v1
#mydata$v1[mydata$v1==99] <- NA
Arithmetic functions on missing values yield missing values.
x <- c(1,2,NA,3)
mean(x) # returns NA
## [1] NA
mean(x, na.rm=TRUE) # returns 2
## [1] 2
The function complete.cases() returns a logical vector indicating which cases are complete.
-> List rows of data that have missing values
mydata[!complete.cases(mydata),]
The function na.omit() returns the object with listwise deletion of missing values.
-> Create new dataset without missing data
x
## [1] 1 2 NA 3
newdata <- na.omit(x)
newdata
## [1] 1 2 3
## attr(,"na.action")
## [1] 3
## attr(,"class")
## [1] "omit"
Dates are represented as the number of days since 1970-01-01, with negative values for earlier dates.
Use as.Date( ) to convert strings to dates
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
mydates
## [1] "2007-06-22" "2004-02-13"
class(mydates)
## [1] "Date"
# number of days between 6/22/07 and 2/13/04
days <- mydates[1] - mydates[2]
days
## Time difference of 1225 days
Sys.Date( ) returns today’s date. date() returns the current date and time.
Sys.Date( )
## [1] "2018-01-15"
date()
## [1] "Mon Jan 15 10:40:20 2018"
Convert date info in format ‘mm/dd/yyyy’
strDates <- c("01/05/1965", "08/16/1975")
dates <- as.Date(strDates, "%m/%d/%Y")
dates
## [1] "1965-01-05" "1975-08-16"
The default format is yyyy-mm-dd
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
Date to Character You can convert dates to character data using the as.Character( ) function.
Convert dates to character data
strDates <- as.character(dates)
Once you have access to your data, you will want to massage it into useful form. This includes creating new variables (including recoding and renaming existing variables), sorting and merging datasets, aggregating data, reshaping data, and subsetting datasets (including selecting observations that meet criteria, randomly sampling observeration, and dropping or keeping variables).
Each of these activities usually involve the use of R’s built-in operators (arithmetic and logical) and functions (numeric, character, and statistical). Additionally, you may need to use control structures (if-then, for, while, switch) in your programs and/or create your own functions. Finally you may need to convert variables or datasets from one type to another (e.g. numeric to character or matrix to data frame).
This section describes each task from an R perspective.
Use the assignment operator <- to create new variables. A wide array of operators and functions are available here.
Three examples for doing the same computations
paracou$XYsum1 <- paracou$X + paracou$Y
paracou$XYmean1 <- (paracou$X + paracou$Y)/2
head(paracou)
attach(paracou)
paracou$XYsum2 <- X + Y
paracou$XYmean2 <- (X + Y)/2
detach(paracou)
head(paracou)
paracou <- transform( paracou,
sum3 = X + Y,
mean3 = (X + Y)/2
)
head(paracou)
In order to recode data, you will probably use one or more of R’s control structures.
1st example: create 2 size categories (big and small) according to Circonference >35
paracou$size <- ifelse(paracou$Circonference > 35,
c("big"), c("small"))
head(paracou)
Another example: create 3 size categories with thresolds: 34 and 37
attach(paracou)
paracou$size[Circonference < 34] <- "small"
paracou$size[Circonference >= 34 & Circonference <= 37] <- "medium"
paracou$size[Circonference > 37] <- "big"
detach(paracou)
head(paracou)
You can rename variables programmatically or interactively.
-> rename interactively (Windows user only, not working on Mac, Ubuntu..)
#library(Xquartz)
#fix(paracou) # results are saved on close
-> rename programmatically
library(reshape)
paracou <- rename(paracou, c(size="newname"))
head(paracou)
-> re-enter all the variable names in order changing the ones you need to change. The limitation is that you need to enter all of them!
mymatrix
## C1 C2
## R1 1 26
## R2 24 68
colnames(mymatrix) <- c("x1","x2")
mymatrix
## x1 x2
## R1 1 26
## R2 24 68
R’s binary and logical operators will look very familiar to programmers. Note that binary operators work on vectors and matrices as well as scalars.
| Operator | Description |
|---|---|
| + | addition |
| - | subtraction |
| * | multiplication |
| / | division |
| ^ or ** | exponentiation |
| x %% y | modulus (x mod y) 5%%2 is 1 |
| x %/% y | integer division 5%/%2 is 2 |
| Operator | Description |
|---|---|
| < | less than |
| <= | less than or equal to |
| > | greater than |
| >= | greater than or equal to |
| == | exactly equal to |
| != | not equal to |
| !x | Not x |
| x VertBar y | x OR y |
| x & y | x AND y |
| isTRUE(x) | test if X is TRUE |
x <- c(1:10)
x[(x>8) | (x<5)]
## [1] 1 2 3 4 9 10
x <- c(1:10)
x
## [1] 1 2 3 4 5 6 7 8 9 10
x > 8
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
x < 5
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
x > 8 | x < 5
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
x[c(T,T,T,T,F,F,F,F,T,T)]
## [1] 1 2 3 4 9 10
Almost everything in R is done through functions. Here I’m only refering to numeric and character functions that are commonly used in creating or recoding variables.
| Function | Description |
|---|---|
| abs(x) | absolute value |
| sqrt(x) | square root |
| ceiling(x) | ceiling(3.475) is 4 |
| floor(x) | floor(3.475) is 3 |
| trunc(x) | trunc(5.99) is 5 |
| round(x, digits=n) | round(3.475, digits=2) is 3.48 |
| cos(x), sin(x), tan(x) also acos(x), cosh(x), acosh(x), etc. | |
| log(x) | natural logarithm |
| log10(x) | common logarithm |
| exp(x) | e^x |
| Function | Description |
|---|---|
| substr(x, start=n1, stop=n2) | Extract or replace substrings in a vector |
| grep(pattern, x ) | Search for pattern in x. Returns matching indices. |
| sub(pattern, replacement, x) | Find pattern in x and replace. |
| strsplit(x, split) | Split the elements of vector x at split. |
| paste(…, sep=“”) | Concatenate strings using sep to seperate them. |
| toupper(x) | Uppercase |
| tolower(x) | Lowercase |
x <- "abcdef"
substr(x, 2, 4)
## [1] "bcd"
substr(x, 2, 4)
## [1] "bcd"
grep("A", c("b","A","c"), fixed=TRUE)
## [1] 2
sub("\\s",".","Hello There")
## [1] "Hello.There"
strsplit("abc", "")
## [[1]]
## [1] "a" "b" "c"
paste("x",1:3,sep="")
## [1] "x1" "x2" "x3"
paste("x",1:3,sep="M")
## [1] "xM1" "xM2" "xM3"
paste("Today is", date())
## [1] "Today is Mon Jan 15 10:40:20 2018"
The following table describes functions related to probaility distributions. For random number generators below, you can use set.seed(1234) or some other integer to create reproducible pseudo-random numbers.
| Function | Description |
|---|---|
| dnorm(x) | normal density function (by default m=0 sd=1) |
| pnorm(q) | cumulative normal probability for q |
| qnorm(p) | normal quantile |
| rnorm(n, m=0,sd=1) | n random normal numbers with mu m, sigma sd |
x <- pretty(c(-3,3), 30)
x
## [1] -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4
## [15] -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4
## [29] 2.6 2.8 3.0
y <- dnorm(x)
y
## [1] 0.00443 0.00792 0.01358 0.02239 0.03547 0.05399 0.07895 0.11092
## [9] 0.14973 0.19419 0.24197 0.28969 0.33322 0.36827 0.39104 0.39894
## [17] 0.39104 0.36827 0.33322 0.28969 0.24197 0.19419 0.14973 0.11092
## [25] 0.07895 0.05399 0.03547 0.02239 0.01358 0.00792 0.00443
plot(x, y, type='l', xlab="Normal Deviate", ylab="Density", yaxs="i")
pnorm(1.96)
## [1] 0.975
qnorm(.9)
## [1] 1.28
dbinom(x, size, prob), pbinom(q, size, prob), qbinom(p, size, prob), rbinom(n, size, prob)
dbinom(-5:20, 10, 0.5)
## [1] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000977 0.009766
## [8] 0.043945 0.117188 0.205078 0.246094 0.205078 0.117188 0.043945
## [15] 0.009766 0.000977 0.000000 0.000000 0.000000 0.000000 0.000000
## [22] 0.000000 0.000000 0.000000 0.000000 0.000000
pbinom(-5:20, 10, .5)
## [1] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000977 0.010742
## [8] 0.054687 0.171875 0.376953 0.623047 0.828125 0.945312 0.989258
## [15] 0.999023 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
## [22] 1.000000 1.000000 1.000000 1.000000 1.000000
poisson distribution with m=std=lamda
** dpois(x, lamda), ppois(q, lamda), qpois(p, lamda), rpois(n, lamda)**
#probability of 0,1, or 2 events with lamda=4
dpois(0:2, 4)
## [1] 0.0183 0.0733 0.1465
# probability of at least 3 events with lamda=4
1- ppois(2,4)
## [1] 0.762
Uniform distribution, follows the same pattern as the normal distribution above.
dunif(x, min=0, max=1), punif(q, min=0, max=1), qunif(p, min=0, max=1), runif(n, min=0, max=1)
#10 uniform random variates
x <- runif(10)
Other useful statistical functions are provided in the following table. Each has the option na.rm to strip missing values before calculations. Otherwise the presence of missing values will lead to a missing result. Object can be a numeric vector or data frame.
| Function | Description |
|---|---|
| mean(x, trim=0,na.rm=FALSE) | mean of object x |
| sd(x) | standard deviation of object(x) |
| var(x) f | variance |
| median(x) | median |
| quantile(x, probs) | quantiles where x is the quantile vector |
| range(x) | range |
| sum(x) | sum |
| diff(x, lag=1) | lagged differences |
| min(x) | minimum |
| max(x) | maximum |
| scale(x, center=TRUE, scale=TRUE) | column center or standardize a matrix |
# trimmed mean, removing any missing values and # 5 percent of highest and lowest scores
mx <- mean(age,trim=.05,na.rm=TRUE)
mx
## [1] 37
# 30th and 84th percentiles of x
y <- quantile(x, c(.3,.84))
y
## 30% 84%
## 0.594 0.892
| Function | Description |
|---|---|
| seq(from , to, by) | generate a sequence |
| rep(x, ntimes) | repeat x n times |
| cut(x, n) | divide continuous variable in factor with n levels |
indices <- seq(1,10,2)
indices
## [1] 1 3 5 7 9
y <- rep(1:3, 2)
y
## [1] 1 2 3 1 2 3
x
## [1] 0.73282 0.87162 0.89639 0.70588 0.88684 0.12105 0.98174 0.00742
## [9] 0.33138 0.73420
z <- cut(x, 5)
z
## [1] (0.592,0.787] (0.787,0.983] (0.787,0.983] (0.592,0.787]
## [5] (0.787,0.983] (0.00645,0.202] (0.787,0.983] (0.00645,0.202]
## [9] (0.202,0.397] (0.592,0.787]
## 5 Levels: (0.00645,0.202] (0.202,0.397] (0.397,0.592] ... (0.787,0.983]
R has the standard control structures you would expect. expr can be multiple (compound) statements by enclosing them in braces { }. It is more efficient to use built-in functions rather than control structures whenever possible.
if-else if (cond) expr if (cond) expr1 else expr2
for for (var in seq) expr
while while (cond) expr
switch switch(expr, …)
ifelse ifelse(test,yes,no)
Example to transpose of a matrix, i.e. a poor alternative to built-in t() function
mytrans <- function(x) {
if (!is.matrix(x)) {
warning("argument is not a matrix: returning NA")
return(NA_real_)
}
y <- matrix(1, nrow=ncol(x), ncol=nrow(x))
for (i in 1:nrow(x)) {
for (j in 1:ncol(x)) {
y[j,i] <- x[i,j]
}
}
return(y)
}
mymatrix
## x1 x2
## R1 1 26
## R2 24 68
mytrans(mymatrix)
## [,1] [,2]
## [1,] 1 24
## [2,] 26 68
One of the great strengths of R is the user’s ability to add functions. In fact, many of the functions in R are actually functions of functions. The structure of a function is given below.
myfunction <- function(arg1, arg2, … ){ statements return(object) }
Objects in the function are local to the function. The object returned can be any data type. Here is an example.
function example - get measures of central tendency and spread for a numeric vector x. The user has a choice of measures and whether the results are printed
mysummary <- function(x,npar=TRUE,print=TRUE) {
if (!npar) {
center <- mean(x); spread <- sd(x)
} else {
center <- median(x); spread <- mad(x)
}
if (print & !npar) {
cat("Mean=", center, "\n", "SD=", spread, "\n")
} else if (print & npar) {
cat("Median=", center, "\n", "MAD=", spread, "\n")
}
result <- list(center=center,spread=spread)
return(result)
}
mysummary
## function(x,npar=TRUE,print=TRUE) {
## if (!npar) {
## center <- mean(x); spread <- sd(x)
## } else {
## center <- median(x); spread <- mad(x)
## }
## if (print & !npar) {
## cat("Mean=", center, "\n", "SD=", spread, "\n")
## } else if (print & npar) {
## cat("Median=", center, "\n", "MAD=", spread, "\n")
## }
## result <- list(center=center,spread=spread)
## return(result)
## }
# invoking the function
set.seed(1234)
x <- rpois(500, 4)
y <- mysummary(x)
## Median= 4
## MAD= 1.48
# y$center is the median (4)
# y$spread is the median absolute deviation (1.48)
It can be instructive to look at the code of a function. In R, you can view a function’s code by typing the function name without the ( ). If this method fails, look at the following R Wiki link for hints on viewing function sourcecode.
Finally, you may want to store your own functions, and have them available in every session. You can customize the R environment to load your functions at start-up.
Use sort() in order to sort (or order) a vector or factor (partially) into ascending or descending order.
a
## [1] 1.0 2.0 5.3 6.0 -2.0 4.0
sort(a, decreasing = F)
## [1] -2.0 1.0 2.0 4.0 5.3 6.0
sort(a, decreasing = T)
## [1] 6.0 5.3 4.0 2.0 1.0 -2.0
To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the sorting variable by a minus sign to indicate DESCENDING order. Below are some examples.
# sorting examples using the mtcars dataset
attach(mtcars)
# sort by cyl
newdata <- mtcars[order(cyl),]
newdata
# sort by cyl and mpg
newdata <- mtcars[order(cyl, mpg),]
newdata
#sort by cyl (ascending) and mpg (descending)
newdata <- mtcars[order(cyl, -mpg),]
newdata
detach(mtcars)
To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join).
Merge two data frames by ID
data1<-paracou[1:10, 1:4]
data1
data2<-paracou[1:10, 4:8]
data2
total <- merge(data1,data2,by="NumArbre")
total
To join two data frames (datasets) vertically, use the rbind() function. The two data frames must have the same variables, but they do not have to be in the same order.
data1<-paracou[1:5, 1:5]
data1
data2<-paracou[6:10, 1:5]
data2
total <- rbind(data1,data2)
total
If data frameA has variables that data frameB does not, then either:
Delete the extra variables in data frameA or
Create the additional variables in data frameB and set them to NA (missing) before joining them with rbind( ).
It is relatively easy to collapse data in R using one or more BY variables and a defined function.
Aggregate data frame mtcars by cyl and vs, returning means for numeric variables
attach(mtcars)
aggdata <-aggregate(mtcars, by=list(cyl,vs),
FUN=mean, na.rm=TRUE)
print(aggdata)
## Group.1 Group.2 mpg cyl disp hp drat wt qsec vs am gear carb
## 1 4 0 26.0 4 120 91.0 4.43 2.14 16.7 0 1.000 5.00 2.00
## 2 6 0 20.6 6 155 131.7 3.81 2.75 16.3 0 1.000 4.33 4.67
## 3 8 0 15.1 8 353 209.2 3.23 4.00 16.8 0 0.143 3.29 3.50
## 4 4 1 26.7 4 104 81.8 4.04 2.30 19.4 1 0.700 4.00 1.50
## 5 6 1 19.1 6 205 115.2 3.42 3.39 19.2 1 0.000 3.50 2.50
detach(mtcars)
When using the aggregate() function, the by variables must be in a list (even if there is only one). The function can be built-in or user provided.
See also summarize() in the Hmisc package, summaryBy() in the doBy package
R provides a variety of methods for reshaping data prior to analysis.
Use the t() function to transpose a matrix or a data frame. In the later case, rownames become variable (column) names.
mtcars
t(mtcars)
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
## mpg 21.00 21.00 22.80 21.40 18.70
## cyl 6.00 6.00 4.00 6.00 8.00
## disp 160.00 160.00 108.00 258.00 360.00
## hp 110.00 110.00 93.00 110.00 175.00
## drat 3.90 3.90 3.85 3.08 3.15
## wt 2.62 2.88 2.32 3.21 3.44
## qsec 16.46 17.02 18.61 19.44 17.02
## vs 0.00 0.00 1.00 1.00 0.00
## am 1.00 1.00 1.00 0.00 0.00
## gear 4.00 4.00 4.00 3.00 3.00
## carb 4.00 4.00 1.00 1.00 2.00
## Valiant Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE
## mpg 18.10 14.30 24.40 22.80 19.20 17.80 16.40
## cyl 6.00 8.00 4.00 4.00 6.00 6.00 8.00
## disp 225.00 360.00 146.70 140.80 167.60 167.60 275.80
## hp 105.00 245.00 62.00 95.00 123.00 123.00 180.00
## drat 2.76 3.21 3.69 3.92 3.92 3.92 3.07
## wt 3.46 3.57 3.19 3.15 3.44 3.44 4.07
## qsec 20.22 15.84 20.00 22.90 18.30 18.90 17.40
## vs 1.00 0.00 1.00 1.00 1.00 1.00 0.00
## am 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## gear 3.00 3.00 4.00 4.00 4.00 4.00 3.00
## carb 1.00 4.00 2.00 2.00 4.00 4.00 3.00
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## mpg 17.30 15.20 10.40 10.40
## cyl 8.00 8.00 8.00 8.00
## disp 275.80 275.80 472.00 460.00
## hp 180.00 180.00 205.00 215.00
## drat 3.07 3.07 2.93 3.00
## wt 3.73 3.78 5.25 5.42
## qsec 17.60 18.00 17.98 17.82
## vs 0.00 0.00 0.00 0.00
## am 0.00 0.00 0.00 0.00
## gear 3.00 3.00 3.00 3.00
## carb 3.00 3.00 4.00 4.00
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
## mpg 14.70 32.40 30.40 33.90 21.50
## cyl 8.00 4.00 4.00 4.00 4.00
## disp 440.00 78.70 75.70 71.10 120.10
## hp 230.00 66.00 52.00 65.00 97.00
## drat 3.23 4.08 4.93 4.22 3.70
## wt 5.34 2.20 1.61 1.83 2.46
## qsec 17.42 19.47 18.52 19.90 20.01
## vs 0.00 1.00 1.00 1.00 1.00
## am 0.00 1.00 1.00 1.00 0.00
## gear 3.00 4.00 4.00 4.00 3.00
## carb 4.00 1.00 2.00 1.00 1.00
## Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9
## mpg 15.50 15.20 13.30 19.20 27.30
## cyl 8.00 8.00 8.00 8.00 4.00
## disp 318.00 304.00 350.00 400.00 79.00
## hp 150.00 150.00 245.00 175.00 66.00
## drat 2.76 3.15 3.73 3.08 4.08
## wt 3.52 3.44 3.84 3.85 1.94
## qsec 16.87 17.30 15.41 17.05 18.90
## vs 0.00 0.00 0.00 0.00 1.00
## am 0.00 0.00 0.00 0.00 1.00
## gear 3.00 3.00 3.00 3.00 4.00
## carb 2.00 2.00 4.00 2.00 1.00
## Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora
## mpg 26.00 30.40 15.80 19.70 15.00
## cyl 4.00 4.00 8.00 6.00 8.00
## disp 120.30 95.10 351.00 145.00 301.00
## hp 91.00 113.00 264.00 175.00 335.00
## drat 4.43 3.77 4.22 3.62 3.54
## wt 2.14 1.51 3.17 2.77 3.57
## qsec 16.70 16.90 14.50 15.50 14.60
## vs 0.00 1.00 0.00 0.00 0.00
## am 1.00 1.00 1.00 1.00 1.00
## gear 5.00 5.00 5.00 5.00 5.00
## carb 2.00 2.00 4.00 6.00 8.00
## Volvo 142E
## mpg 21.40
## cyl 4.00
## disp 121.00
## hp 109.00
## drat 4.11
## wt 2.78
## qsec 18.60
## vs 1.00
## am 1.00
## gear 4.00
## carb 2.00
Hadley Wickham has created a comprehensive package called reshape to massage data. Basically, you “melt” data so that each row is a unique id-variable combination. Then you “cast” the melted data into any shape you would like. Here is a very simple example.
library(reshape)
mydata
mdata <- melt(mydata, id=c("gender"))
mdata
subjmeans <- cast(mdata, gender~variable, mean)
subjmeans
There is much more that you can do with the melt( ) and cast( ) functions. See the package documentation ??reshape for more details.
R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations. The following code snippets demonstrate ways to keep or delete variables and observations and to take random samples from a dataset.
select variables NomVern, Circonference
myvars <- c("NomVern", "Circonference")
newparacou <- paracou[myvars]
head(newparacou)
select 1st and 5th through 10th variables
newparacou <- paracou[,c(1,5:10)]
head(newparacou)
exclude variables v1, v2, v3
myvars <- names(paracou) %in% c("XYsum", "XYmean", "XYsum2", "XYmean2", "sum3", "mean3","newname")
paracou <- paracou[!myvars]
head(paracou)
exclude 3rd and 5th variable
newdata <- paracou[c(-3,-5)]
head(newdata)
Selecting Observations
newdata <- paracou[1:5,]
newdata
based on variable values
paracou[ which(paracou$Circonference > 37), ]
Selection using the Subset Function
The subset( ) function is the easiest way to select variables and observations. In the following example, we select all trees over the circonference of 35 and with tree numbers over 2970. And we keep NomVern.
using subset function
newdata <- subset(paracou, Circonference >= 35 & NumArbre > 2970,
select=c(NomVern, Circonference, NumArbre))
newdata
Use the sample( ) function to take a random sample of size n from a dataset.
cells
## [1] 1 26 24 68
sample(cells, 2, replace = F)
## [1] 24 68
sample(cells, 2, replace = F)
## [1] 68 1
** TRY IT!! take a random sample of size 10 from the paracou dataset**
mysample <- paracou[sample(1:nrow(paracou), 10, replace=FALSE),]
mysample
Type conversions in R work as you would expect. For example, adding a character string to a numeric vector converts all the elements in the vector to character.
Use class() to test for data class.
Use the following for specific tests
is.numeric(), is.character(), is.vector(), is.matrix(), is.data.frame()
Use the following to convert
as.numeric(), as.character(), as.vector(), as.matrix(), as.data.frame()
| to one long vector | to matrix | to data frame | |
|---|---|---|---|
| from vector | c(x,y) | cbind(x,y) OR rbind(x,y) | data.frame(x,y) |
| from matrix | as.vector(mymatrix) | as.data.frame(mymatrix) | |
| from data frame | as.matrix(myframe) |
This section describes basic (and not so basic) statistics. It includes code for obtaining descriptive statistics, frequency counts and crosstabulations (including tests of independence), correlations (pearson, spearman, kendall, polychoric), t-tests (with equal and unequal variances), nonparametric tests of group differences (Mann Whitney U, Wilcoxon Signed Rank, Kruskall Wallis Test, Friedman Test), multiple linear regression (including diagnostics, cross-validation and variable selection), analysis of variance (including ANCOVA and MANOVA), and statistics based on resampling.
Since modern data analyses almost always involve graphical assessments of relationships and assumptions, links to appropriate graphical methods are provided throughout.
It is always important to check model assumptions before making statistical inferences. Although it is somewhat artificial to separate regression modeling and an ANOVA framework in this regard, many people learn these topics separately, so I’ve followed the same convention here.
Regression diagnostics cover outliers, influential observations, non-normality, non-constant error variance, multicolinearity, nonlinearity, and non-independence of errors. Classical test assumptions for ANOVA/ANCOVA/MANCOVA include the assessment of normality and homogeneity of variances in the univariate case, and multivariate normality and homogeneity of covariance matrices in the multivariate case. The identification of multivariate outliers is also considered.
Power analysis provides methods of statistical power analysis and sample size estimation for a variety of designs.
Finally, two functions that aid in efficient processing (with and by) are described.
More advanced statistical modeling can be found in the Advanced Statistics section. mydata
R provides a wide range of functions for obtaining summary statistics. One method of obtaining descriptive statistics is to use the sapply( ) function with a specified summary statistic.
sapply(mtcars, mean, na.rm=TRUE)
## mpg cyl disp hp drat wt qsec vs am
## 20.091 6.188 230.722 146.688 3.597 3.217 17.849 0.438 0.406
## gear carb
## 3.688 2.812
Possible functions used in sapply include mean, sd, var, min, max, median, range, and quantile. There are also numerous R functions designed to provide a range of descriptive statistics at once. For example
mean,median,25th and 75th quartiles,min,max
summary(mtcars)
## mpg cyl disp hp drat
## Min. :10.4 Min. :4.00 Min. : 71 Min. : 52 Min. :2.76
## 1st Qu.:15.4 1st Qu.:4.00 1st Qu.:121 1st Qu.: 96 1st Qu.:3.08
## Median :19.2 Median :6.00 Median :196 Median :123 Median :3.69
## Mean :20.1 Mean :6.19 Mean :231 Mean :147 Mean :3.60
## 3rd Qu.:22.8 3rd Qu.:8.00 3rd Qu.:326 3rd Qu.:180 3rd Qu.:3.92
## Max. :33.9 Max. :8.00 Max. :472 Max. :335 Max. :4.93
## wt qsec vs am
## Min. :1.51 Min. :14.5 Min. :0.000 Min. :0.000
## 1st Qu.:2.58 1st Qu.:16.9 1st Qu.:0.000 1st Qu.:0.000
## Median :3.33 Median :17.7 Median :0.000 Median :0.000
## Mean :3.22 Mean :17.9 Mean :0.438 Mean :0.406
## 3rd Qu.:3.61 3rd Qu.:18.9 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :5.42 Max. :22.9 Max. :1.000 Max. :1.000
## gear carb
## Min. :3.00 Min. :1.00
## 1st Qu.:3.00 1st Qu.:2.00
## Median :4.00 Median :2.00
## Mean :3.69 Mean :2.81
## 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :5.00 Max. :8.00
Tukey min,lower-hinge, median,upper-hinge,max
fivenum(x)
## [1] 0 3 4 5 11
Using the Hmisc package
library(Hmisc)
describe(x)
## x
## n missing distinct Info Mean Gmd .05 .10
## 500 0 12 0.973 4.052 2.21 1 2
## .25 .50 .75 .90 .95
## 3 4 5 7 8
##
## Value 0 1 2 3 4 5 6 7 8 9
## Frequency 8 29 71 110 99 87 36 28 15 11
## Proportion 0.016 0.058 0.142 0.220 0.198 0.174 0.072 0.056 0.030 0.022
##
## Value 10 11
## Frequency 3 3
## Proportion 0.006 0.006
Using the pastecs package
library(pastecs)
stat.desc(mtcars)
Using the psych package
library(psych)
#describe(mtcars)
Summary Statistics by Group
A simple way of generating summary statistics by grouping variable is available in the psych package.
describe.by(mtcars, mtcars$cyl)
## Warning: describe.by is deprecated. Please use the describeBy function
##
## Descriptive statistics by group
## group: 4
## vars n mean sd median trimmed mad min max range skew
## mpg 1 11 26.66 4.51 26.00 26.44 6.52 21.40 33.90 12.50 0.26
## cyl 2 11 4.00 0.00 4.00 4.00 0.00 4.00 4.00 0.00 NaN
## disp 3 11 105.14 26.87 108.00 104.30 43.00 71.10 146.70 75.60 0.12
## hp 4 11 82.64 20.93 91.00 82.67 32.62 52.00 113.00 61.00 0.01
## drat 5 11 4.07 0.37 4.08 4.02 0.34 3.69 4.93 1.24 1.00
## wt 6 11 2.29 0.57 2.20 2.27 0.54 1.51 3.19 1.68 0.30
## qsec 7 11 19.14 1.68 18.90 18.99 1.48 16.70 22.90 6.20 0.55
## vs 8 11 0.91 0.30 1.00 1.00 0.00 0.00 1.00 1.00 -2.47
## am 9 11 0.73 0.47 1.00 0.78 0.00 0.00 1.00 1.00 -0.88
## gear 10 11 4.09 0.54 4.00 4.11 0.00 3.00 5.00 2.00 0.11
## carb 11 11 1.55 0.52 2.00 1.56 0.00 1.00 2.00 1.00 -0.16
## kurtosis se
## mpg -1.65 1.36
## cyl NaN 0.00
## disp -1.64 8.10
## hp -1.71 6.31
## drat 0.12 0.11
## wt -1.36 0.17
## qsec -0.02 0.51
## vs 4.52 0.09
## am -1.31 0.14
## gear -0.01 0.16
## carb -2.15 0.16
## --------------------------------------------------------
## group: 6
## vars n mean sd median trimmed mad min max range skew
## mpg 1 7 19.74 1.45 19.70 19.74 1.93 17.80 21.40 3.60 -0.16
## cyl 2 7 6.00 0.00 6.00 6.00 0.00 6.00 6.00 0.00 NaN
## disp 3 7 183.31 41.56 167.60 183.31 11.27 145.00 258.00 113.00 0.80
## hp 4 7 122.29 24.26 110.00 122.29 7.41 105.00 175.00 70.00 1.36
## drat 5 7 3.59 0.48 3.90 3.59 0.03 2.76 3.92 1.16 -0.74
## wt 6 7 3.12 0.36 3.21 3.12 0.36 2.62 3.46 0.84 -0.22
## qsec 7 7 17.98 1.71 18.30 17.98 1.90 15.50 20.22 4.72 -0.12
## vs 8 7 0.57 0.53 1.00 0.57 0.00 0.00 1.00 1.00 -0.23
## am 9 7 0.43 0.53 0.00 0.43 0.00 0.00 1.00 1.00 0.23
## gear 10 7 3.86 0.69 4.00 3.86 0.00 3.00 5.00 2.00 0.11
## carb 11 7 3.43 1.81 4.00 3.43 0.00 1.00 6.00 5.00 -0.26
## kurtosis se
## mpg -1.91 0.55
## cyl NaN 0.00
## disp -1.23 15.71
## hp 0.25 9.17
## drat -1.40 0.18
## wt -1.98 0.13
## qsec -1.75 0.65
## vs -2.20 0.20
## am -2.20 0.20
## gear -1.24 0.26
## carb -1.50 0.69
## --------------------------------------------------------
## group: 8
## vars n mean sd median trimmed mad min max range skew
## mpg 1 14 15.10 2.56 15.20 15.15 1.56 10.40 19.20 8.80 -0.36
## cyl 2 14 8.00 0.00 8.00 8.00 0.00 8.00 8.00 0.00 NaN
## disp 3 14 353.10 67.77 350.50 349.63 73.39 275.80 472.00 196.20 0.45
## hp 4 14 209.21 50.98 192.50 203.67 44.48 150.00 335.00 185.00 0.91
## drat 5 14 3.23 0.37 3.12 3.19 0.16 2.76 4.22 1.46 1.34
## wt 6 14 4.00 0.76 3.75 3.95 0.41 3.17 5.42 2.25 0.99
## qsec 7 14 16.77 1.20 17.18 16.86 0.79 14.50 18.00 3.50 -0.80
## vs 8 14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN
## am 9 14 0.14 0.36 0.00 0.08 0.00 0.00 1.00 1.00 1.83
## gear 10 14 3.29 0.73 3.00 3.17 0.00 3.00 5.00 2.00 1.83
## carb 11 14 3.50 1.56 3.50 3.25 0.74 2.00 8.00 6.00 1.48
## kurtosis se
## mpg -0.57 0.68
## cyl NaN 0.00
## disp -1.26 18.11
## hp 0.09 13.62
## drat 1.08 0.10
## wt -0.71 0.20
## qsec -0.92 0.32
## vs NaN 0.00
## am 1.45 0.10
## gear 1.45 0.19
## carb 2.24 0.42
The doBy package provides many functionalities. It defines the desired table using a model formula and a function.
This section describes the creation of frequency and contingency tables from categorical variables, along with tests of independence, measures of association, and methods for graphically displaying results.
R provides many methods for creating frequency and contingency tables. Three are described below. In the following examples, assume that A, B, and C represent categorical variables.
Table
You can generate frequency tables using the table( ) function, tables of proportions using the prop.table( ) function, and marginal frequencies using margin.table( ).
attach(mtcars)
## The following object is masked from package:ggplot2:
##
## mpg
mytable <- table(cyl,carb)
mytable
## carb
## cyl 1 2 3 4 6 8
## 4 5 6 0 0 0 0
## 6 2 0 0 4 1 0
## 8 0 4 3 6 0 1
margin.table(mytable, 1)
## cyl
## 4 6 8
## 11 7 14
margin.table(mytable, 2)
## carb
## 1 2 3 4 6 8
## 7 10 3 10 1 1
prop.table(mytable) # cell percentages
## carb
## cyl 1 2 3 4 6 8
## 4 0.1562 0.1875 0.0000 0.0000 0.0000 0.0000
## 6 0.0625 0.0000 0.0000 0.1250 0.0312 0.0000
## 8 0.0000 0.1250 0.0938 0.1875 0.0000 0.0312
prop.table(mytable, 1) # row percentages
## carb
## cyl 1 2 3 4 6 8
## 4 0.4545 0.5455 0.0000 0.0000 0.0000 0.0000
## 6 0.2857 0.0000 0.0000 0.5714 0.1429 0.0000
## 8 0.0000 0.2857 0.2143 0.4286 0.0000 0.0714
prop.table(mytable, 2) # column percentages
## carb
## cyl 1 2 3 4 6 8
## 4 0.714 0.600 0.000 0.000 0.000 0.000
## 6 0.286 0.000 0.000 0.400 1.000 0.000
## 8 0.000 0.400 1.000 0.600 0.000 1.000
table( ) can also generate multidimensional tables based on 3 or more categorical variables. In this case, use the ftable( ) function to print the results more attractively.
mytable <- table(cyl, carb, am)
ftable(mytable)
## am 0 1
## cyl carb
## 4 1 1 4
## 2 2 4
## 3 0 0
## 4 0 0
## 6 0 0
## 8 0 0
## 6 1 2 0
## 2 0 0
## 3 0 0
## 4 2 2
## 6 0 1
## 8 0 0
## 8 1 0 0
## 2 4 0
## 3 3 0
## 4 5 1
## 6 0 0
## 8 0 1
Table ignores missing values. To include NA as a category in counts, include the table option exclude=NULL if the variable is a vector. If the variable is a factor you have to create a new factor using newfactor <- factor(oldfactor, exclude=NULL).
xtabs
The xtabs( ) function allows you to create crosstabulations using formula style input.
mytable <- xtabs(~cyl+carb+am, data=mydata)
ftable(mytable) # print table
## am 0 1
## cyl carb
## 4 1 1 4
## 2 2 4
## 3 0 0
## 4 0 0
## 6 0 0
## 8 0 0
## 6 1 2 0
## 2 0 0
## 3 0 0
## 4 2 2
## 6 0 1
## 8 0 0
## 8 1 0 0
## 2 4 0
## 3 3 0
## 4 5 1
## 6 0 0
## 8 0 1
summary(mytable) # chi-square test of indepedence
## Call: xtabs(formula = ~cyl + carb + am, data = mydata)
## Number of cases in table: 32
## Number of factors: 3
## Test for independence of all factors:
## Chisq = 55, df = 27, p-value = 0.001
## Chi-squared approximation may be incorrect
If a variable is included on the left side of the formula, it is assumed to be a vector of frequencies (useful if the data have already been tabulated).
For 2-way tables you can use chisq.test(mytable) to test independence of the row and column variable. By default, the p-value is calculated from the asymptotic chi-squared distribution of the test statistic. Optionally, the p-value can be derived via Monte Carlo simultation.
mytable <- table(cyl,carb)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 20, df = 10, p-value = 0.007
fisher.test(x) provides an exact test of independence. x is a two dimensional contingency table in matrix form.
mytable <- table(cyl,carb)
fisher.test(mytable)
##
## Fisher's Exact Test for Count Data
##
## data: mytable
## p-value = 3e-04
## alternative hypothesis: two.sided
Use the mantelhaen.test(x) function to perform a Cochran-Mantel-Haenszel chi-squared test of the null hypothesis that two nominal variables are conditionally independent in each stratum, assuming that there is no three-way interaction. x is a 3 dimensional contingency table, where the last dimension refers to the strata.
You can use the loglm( ) function in the MASS package to produce log-linear models. For example, let’s assume we have a 3-way contingency table based on variables cyl, carb, and am. We can perform the following tests:
library(MASS)
mytable <- xtabs(~cyl+carb+am, data=mtcars)
#Mutual Independence: cyl, carb, and am are pairwise independent.
loglm(~cyl+carb+am, mytable)
## Call:
## loglm(formula = ~cyl + carb + am, data = mytable)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 52.3 27 0.00247
## Pearson 55.2 27 0.00107
#Partial Independence: cyl is partially independent of carb and am
#(i.e., cyl is independent of the composite variable carbam)
loglm(~cyl+carb+am+carb*am, mytable)
## Call:
## loglm(formula = ~cyl + carb + am + carb * am, data = mytable)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 44.3 22 0.00329
## Pearson NaN 22 NaN
#Conditional Independence: cyl is independent of carb, given am.
loglm(~cyl+carb+am+cyl*am+carb*am, mytable)
## Call:
## loglm(formula = ~cyl + carb + am + cyl * am + carb * am, data = mytable)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 35 20 0.0203
## Pearson NaN 20 NaN
#No Three-Way Interaction
loglm(~cyl+carb+am+cyl*carb+cyl*am+carb*am, mytable)
## Call:
## loglm(formula = ~cyl + carb + am + cyl * carb + cyl * am + carb *
## am, data = mytable)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 0.327 10 1
## Pearson NaN 10 NaN
The assocstats(mytable) function in the vcd package calculates the phi coefficient, contingency coefficient, and Cramer’s V for an rxc table. The kappa(mytable) function in the vcd package calculates Cohen’s kappa and weighted kappa for a confusion matrix.
You can use the cor( ) function to produce correlations and the cov( ) function to produces covariances.
A simplified format is cor(x, use=, method= ) where
x is a matrix or data frame
use Specifies the handling of missing data. Options are all.obs (assumes no missing data - missing data will produce an error), complete.obs (listwise deletion), and pairwise.complete.obs (pairwise deletion)
method Specifies the type of correlation. Options are pearson, spearman or kendall.
# Correlations/covariances among numeric variables in
# data frame mtcars. Use listwise deletion of missing data.
cor(mtcars, use="complete.obs", method="kendall")
## mpg cyl disp hp drat wt qsec vs am
## mpg 1.000 -0.795 -0.768 -0.743 0.4645 -0.728 0.3154 0.590 0.4690
## cyl -0.795 1.000 0.814 0.785 -0.5513 0.728 -0.4490 -0.771 -0.4946
## disp -0.768 0.814 1.000 0.666 -0.4990 0.743 -0.3008 -0.603 -0.5203
## hp -0.743 0.785 0.666 1.000 -0.3826 0.611 -0.4729 -0.631 -0.3040
## drat 0.465 -0.551 -0.499 -0.383 1.0000 -0.547 0.0327 0.375 0.5755
## wt -0.728 0.728 0.743 0.611 -0.5471 1.000 -0.1420 -0.488 -0.6139
## qsec 0.315 -0.449 -0.301 -0.473 0.0327 -0.142 1.0000 0.658 -0.1689
## vs 0.590 -0.771 -0.603 -0.631 0.3751 -0.488 0.6575 1.000 0.1683
## am 0.469 -0.495 -0.520 -0.304 0.5755 -0.614 -0.1689 0.168 1.0000
## gear 0.433 -0.513 -0.476 -0.279 0.5839 -0.544 -0.0913 0.270 0.7708
## carb -0.504 0.465 0.414 0.596 -0.0954 0.371 -0.5064 -0.577 -0.0586
## gear carb
## mpg 0.4332 -0.5044
## cyl -0.5125 0.4654
## disp -0.4760 0.4137
## hp -0.2794 0.5960
## drat 0.5839 -0.0954
## wt -0.5436 0.3714
## qsec -0.0913 -0.5064
## vs 0.2697 -0.5769
## am 0.7708 -0.0586
## gear 1.0000 0.0980
## carb 0.0980 1.0000
cov(mtcars, use="complete.obs")
## mpg cyl disp hp drat wt qsec vs
## mpg 36.32 -9.172 -633.1 -320.73 2.1951 -5.117 4.5091 2.0171
## cyl -9.17 3.190 199.7 101.93 -0.6684 1.367 -1.8869 -0.7298
## disp -633.10 199.660 15360.8 6721.16 -47.0640 107.684 -96.0517 -44.3776
## hp -320.73 101.931 6721.2 4700.87 -16.4511 44.193 -86.7701 -24.9879
## drat 2.20 -0.668 -47.1 -16.45 0.2859 -0.373 0.0871 0.1186
## wt -5.12 1.367 107.7 44.19 -0.3727 0.957 -0.3055 -0.2737
## qsec 4.51 -1.887 -96.1 -86.77 0.0871 -0.305 3.1932 0.6706
## vs 2.02 -0.730 -44.4 -24.99 0.1186 -0.274 0.6706 0.2540
## am 1.80 -0.466 -36.6 -8.32 0.1902 -0.338 -0.2050 0.0423
## gear 2.14 -0.649 -50.8 -6.36 0.2760 -0.421 -0.2804 0.0766
## carb -5.36 1.520 79.1 83.04 -0.0784 0.676 -1.8941 -0.4637
## am gear carb
## mpg 1.8039 2.1357 -5.3631
## cyl -0.4657 -0.6492 1.5202
## disp -36.5640 -50.8026 79.0687
## hp -8.3206 -6.3589 83.0363
## drat 0.1902 0.2760 -0.0784
## wt -0.3381 -0.4211 0.6758
## qsec -0.2050 -0.2804 -1.8941
## vs 0.0423 0.0766 -0.4637
## am 0.2490 0.2923 0.0464
## gear 0.2923 0.5444 0.3266
## carb 0.0464 0.3266 2.6089
Unfortunately, neither cor( ) or cov( ) produce tests of significance, although you can use the cor.test( ) function to test a single correlation coefficient.
The rcorr( ) function in the Hmisc package produces correlations/covariances and significance levels for pearson and spearman correlations. However, input must be a matrix and pairwise deletion is used.
# Correlations with significance levels
rcorr(as.matrix(mtcars))
## mpg cyl disp hp drat wt qsec vs am gear carb
## mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55
## cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53
## disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39
## hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
## drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.70 -0.09
## wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 -0.55 -0.69 -0.58 0.43
## qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1.00 0.74 -0.23 -0.21 -0.66
## vs 0.66 -0.81 -0.71 -0.72 0.44 -0.55 0.74 1.00 0.17 0.21 -0.57
## am 0.60 -0.52 -0.59 -0.24 0.71 -0.69 -0.23 0.17 1.00 0.79 0.06
## gear 0.48 -0.49 -0.56 -0.13 0.70 -0.58 -0.21 0.21 0.79 1.00 0.27
## carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 -0.57 0.06 0.27 1.00
##
## n= 32
##
##
## P
## mpg cyl disp hp drat wt qsec vs am gear
## mpg 0.0000 0.0000 0.0000 0.0000 0.0000 0.0171 0.0000 0.0003 0.0054
## cyl 0.0000 0.0000 0.0000 0.0000 0.0000 0.0004 0.0000 0.0022 0.0042
## disp 0.0000 0.0000 0.0000 0.0000 0.0000 0.0131 0.0000 0.0004 0.0010
## hp 0.0000 0.0000 0.0000 0.0100 0.0000 0.0000 0.0000 0.1798 0.4930
## drat 0.0000 0.0000 0.0000 0.0100 0.0000 0.6196 0.0117 0.0000 0.0000
## wt 0.0000 0.0000 0.0000 0.0000 0.0000 0.3389 0.0010 0.0000 0.0005
## qsec 0.0171 0.0004 0.0131 0.0000 0.6196 0.3389 0.0000 0.2057 0.2425
## vs 0.0000 0.0000 0.0000 0.0000 0.0117 0.0010 0.0000 0.3570 0.2579
## am 0.0003 0.0022 0.0004 0.1798 0.0000 0.0000 0.2057 0.3570 0.0000
## gear 0.0054 0.0042 0.0010 0.4930 0.0000 0.0005 0.2425 0.2579 0.0000
## carb 0.0011 0.0019 0.0253 0.0000 0.6212 0.0146 0.0000 0.0007 0.7545 0.1290
## carb
## mpg 0.0011
## cyl 0.0019
## disp 0.0253
## hp 0.0000
## drat 0.6212
## wt 0.0146
## qsec 0.0000
## vs 0.0007
## am 0.7545
## gear 0.1290
## carb
You can use the format cor(X, Y) or rcorr(X, Y) to generate correlations between the columns of X and the columns of Y.
# Correlation matrix from mtcars
# with mpg, cyl, and disp as rows
# and hp, drat, and wt as columns
x <- mtcars[1:3]
y <- mtcars[4:6]
cor(x, y)
## hp drat wt
## mpg -0.776 0.681 -0.868
## cyl 0.832 -0.700 0.782
## disp 0.791 -0.710 0.888
The t.test( ) function produces a variety of t-tests. Unlike most statistical packages, the default assumes unequal variance and applies the Welsh df modification.
t.test(y~x), where y is numeric and x is a binary factor
# independent 2-group t-test
t.test(rnorm(10,0,1),rnorm(10,1,1)) # where y1 and y2 are numeric
##
## Welch Two Sample t-test
##
## data: rnorm(10, 0, 1) and rnorm(10, 1, 1)
## t = -4, df = 20, p-value = 6e-04
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.605 -0.858
## sample estimates:
## mean of x mean of y
## -0.211 1.521
# paired t-test
t.test(rnorm(10,0,1),rnorm(10,1,1),paired=TRUE) # where y1 & y2 are numeric
##
## Paired t-test
##
## data: rnorm(10, 0, 1) and rnorm(10, 1, 1)
## t = -2, df = 9, p-value = 0.06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.0434 0.0693
## sample estimates:
## mean of the differences
## -0.987
t.test(y,mu=3) # Ho: mu=3
##
## One Sample t-test
##
## data: y
## t = 6, df = 100, p-value = 3e-08
## alternative hypothesis: true mean is not equal to 3
## 95 percent confidence interval:
## 35.3 67.0
## sample estimates:
## mean of x
## 51.2
You can use the var.equal = TRUE option to specify equal variances and a pooled variance estimate. You can use the alternative=“less” or alternative=“greater” option to specify a one tailed test.
Nonparametric and resampling alternatives to t-tests are available.
R provides functions for carrying out Mann-Whitney U, Wilcoxon Signed Rank, Kruskal Wallis, and Friedman tests.
wilcox.test(paracou$Circonference~rbinom(19,1,0.5))
## Warning in wilcox.test.default(x = c(36, 33.5, 32.5, 34.5, 33, 32.5, 36, :
## cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: paracou$Circonference by rbinom(19, 1, 0.5)
## W = 30, p-value = 0.4
## alternative hypothesis: true location shift is not equal to 0
# where paracou$Circonference is numeric and rbinom(20,1,0.5) is binary factor
wilcox.test(paracou$Circonference[1:9],paracou$Circonference[11:19])
## Warning in wilcox.test.default(paracou$Circonference[1:9], paracou
## $Circonference[11:19]): cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: paracou$Circonference[1:9] and paracou$Circonference[11:19]
## W = 40, p-value = 0.9
## alternative hypothesis: true location shift is not equal to 0
# where y and x are numeric
wilcox.test(paracou$Circonference[1:9],paracou$Circonference[11:19],paired=TRUE)
## Warning in wilcox.test.default(paracou$Circonference[1:9], paracou
## $Circonference[11:19], : cannot compute exact p-value with ties
## Warning in wilcox.test.default(paracou$Circonference[1:9], paracou
## $Circonference[11:19], : cannot compute exact p-value with zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: paracou$Circonference[1:9] and paracou$Circonference[11:19]
## V = 20, p-value = 0.9
## alternative hypothesis: true location shift is not equal to 0
# where y and x are numeric
kruskal.test(mpg~gear)
##
## Kruskal-Wallis rank sum test
##
## data: mpg by gear
## Kruskal-Wallis chi-squared = 10, df = 2, p-value = 8e-04
# where y is numeric and x is a factor
For the wilcox.test you can use the alternative=“less” or alternative=“greater” option to specify a one tailed test.
Parametric and resampling alternatives are available.
R provides comprehensive support for multiple linear regression. The topics below are provided in order of increasing complexity.
Multiple Linear Regression Example
fit <- lm(Circonference ~ X + Y + NumArbre, data=paracou)
summary(fit) # show results
##
## Call:
## lm(formula = Circonference ~ X + Y + NumArbre, data = paracou)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.249 -1.314 -0.465 1.398 3.365
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -219.3771 283.2475 -0.77 0.45
## X -0.0182 0.0120 -1.51 0.15
## Y -0.0112 0.0199 -0.57 0.58
## NumArbre 0.0864 0.0962 0.90 0.38
##
## Residual standard error: 1.88 on 15 degrees of freedom
## Multiple R-squared: 0.168, Adjusted R-squared: 0.00215
## F-statistic: 1.01 on 3 and 15 DF, p-value: 0.414
Other useful functions
coefficients(fit) # model coefficients
## (Intercept) X Y NumArbre
## -219.3771 -0.0182 -0.0112 0.0864
confint(fit, level=0.95) # CIs for model parameters
## 2.5 % 97.5 %
## (Intercept) -823.1050 3.84e+02
## X -0.0438 7.45e-03
## Y -0.0536 3.11e-02
## NumArbre -0.1186 2.91e-01
fitted(fit) # predicted values
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 35.0 35.1 34.2 34.6 33.8 34.0 33.8 33.2 32.9 32.9 33.5 34.3 33.5 32.0 34.6
## 16 17 18 19
## 34.2 34.3 34.1 34.0
residuals(fit) # residuals
## 1 2 3 4 5 6 7 8 9
## 1.0465 -1.1168 -2.2492 3.3655 -1.3026 -0.4646 -1.3263 1.2678 0.0791
## 10 11 12 13 14 15 16 17 18
## -0.8847 -0.9792 1.6668 1.0293 1.5289 2.3893 -2.1566 -1.7810 1.8901
## 19
## -2.0024
anova(fit) # anova table
vcov(fit) # covariance matrix for model parameters
## (Intercept) X Y NumArbre
## (Intercept) 80229.173 2.07e-01 3.11e+00 -2.72e+01
## X 0.207 1.45e-04 -2.43e-05 -7.11e-05
## Y 3.112 -2.43e-05 3.94e-04 -1.07e-03
## NumArbre -27.244 -7.11e-05 -1.07e-03 9.25e-03
influence(fit) # regression diagnostics
## $hat
## 1 2 3 4 5 6 7 8 9 10
## 0.2470 0.2530 0.1529 0.1424 0.1187 0.1509 0.1714 0.1396 0.1906 0.2172
## 11 12 13 14 15 16 17 18 19
## 0.3536 0.2376 0.0802 0.4502 0.2135 0.1424 0.2107 0.3122 0.2159
##
## $coefficients
## (Intercept) X Y NumArbre
## 1 -77.833 -0.001882 -9.41e-04 2.64e-02
## 2 75.938 0.002738 7.64e-04 -2.57e-02
## 3 122.275 -0.000412 3.33e-03 -4.15e-02
## 4 -148.189 -0.003792 -1.90e-03 5.02e-02
## 5 19.669 -0.000272 -2.22e-03 -6.53e-03
## 6 0.273 0.000317 -1.48e-03 -2.08e-05
## 7 -15.853 0.000864 -5.35e-03 5.62e-03
## 8 17.170 0.001344 3.55e-03 -6.00e-03
## 9 -0.387 0.000230 -6.52e-05 1.32e-04
## 10 4.063 -0.002865 1.50e-03 -1.43e-03
## 11 42.477 -0.003280 7.38e-03 -1.47e-02
## 12 -54.785 0.000294 -9.72e-03 1.91e-02
## 13 18.791 0.000777 -1.11e-04 -6.33e-03
## 14 174.637 0.007248 1.19e-02 -5.97e-02
## 15 -0.911 -0.003466 -8.73e-03 9.33e-04
## 16 -70.399 0.002599 1.20e-04 2.36e-02
## 17 -89.494 0.003756 -2.19e-03 3.02e-02
## 18 163.381 -0.005232 7.87e-03 -5.54e-02
## 19 -97.342 0.001857 2.13e-03 3.26e-02
##
## $sigma
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1.92 1.92 1.83 1.69 1.91 1.94 1.91 1.91 1.95 1.93 1.92 1.88 1.93 1.87 1.81
## 16 17 18 19
## 1.84 1.87 1.85 1.85
##
## $wt.res
## 1 2 3 4 5 6 7 8 9
## 1.0465 -1.1168 -2.2492 3.3655 -1.3026 -0.4646 -1.3263 1.2678 0.0791
## 10 11 12 13 14 15 16 17 18
## -0.8847 -0.9792 1.6668 1.0293 1.5289 2.3893 -2.1566 -1.7810 1.8901
## 19
## -2.0024
Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(fit)
You can compare nested models with the anova( ) function.
# compare models
fit1 <- lm(Circonference ~ X + Y + NumArbre, data=paracou)
fit2 <- lm(Circonference ~ X + Y , data=paracou)
anova(fit1, fit2)
You can do K-Fold cross-validation using the cv.lm( ) function in the DAAG package.
library(DAAG)
CVlm(data=nihills, form.lm=formula(log(time)~log(climb)+log(dist)),
plotit="Observed")
## Analysis of Variance Table
##
## Response: log(time)
## Df Sum Sq Mean Sq F value Pr(>F)
## log(climb) 1 5.94 5.94 1013 < 2e-16 ***
## log(dist) 1 0.89 0.89 152 8.2e-11 ***
## Residuals 20 0.12 0.01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## fold 1
## Observations in test set: 7
## Binevenagh Slieve Gullion McVeigh Classic Slieve Martin
## Predicted -0.1129 -0.7174 -0.5623 -0.5310
## cvpred -0.0735 -0.6986 -0.5336 -0.5337
## log(time) -0.1528 -0.7621 -0.6141 -0.5968
## CV residual -0.0793 -0.0636 -0.0805 -0.0631
## Hen & Cock Slieve Bearnagh Lurig Challenge
## Predicted -0.931 -0.33834 -0.7992
## cvpred -0.975 -0.38052 -0.7779
## log(time) -0.799 -0.37429 -0.8330
## CV residual 0.176 0.00623 -0.0551
##
## Sum of squares = 0.05 Mean square = 0.01 n = 7
##
## fold 2
## Observations in test set: 8
## Donard & Commedagh Annalong Horseshoe Monument Race Rocky
## Predicted 0.1184 0.7063 -0.7992 -0.6770
## cvpred 0.1319 0.7327 -0.8194 -0.6900
## log(time) 0.0379 0.6674 -0.7515 -0.6481
## CV residual -0.0940 -0.0653 0.0679 0.0419
## Slieve Donard Flagstaff to Carling Slieve Gallion
## Predicted -0.1044 0.4017 -0.5342
## cvpred -0.0958 0.4135 -0.5441
## log(time) -0.0530 0.3763 -0.4524
## CV residual 0.0428 -0.0372 0.0917
## BARF Turkey Trot
## Predicted -0.3913
## cvpred -0.4010
## log(time) -0.3382
## CV residual 0.0628
##
## Sum of squares = 0.04 Mean square = 0 n = 8
##
## fold 3
## Observations in test set: 8
## Glenariff Mountain Tollymore Mountain Moughanmore
## Predicted -0.446 -0.6989 -0.80639
## cvpred -0.462 -0.7020 -0.76628
## log(time) -0.352 -0.7270 -0.76871
## CV residual 0.110 -0.0251 -0.00243
## Loughshannagh Horseshoe Meelbeg Meelmore Donard Forest
## Predicted -0.5028 -0.616 -0.562
## cvpred -0.4897 -0.586 -0.555
## log(time) -0.4355 -0.789 -0.657
## CV residual 0.0542 -0.202 -0.101
## Seven Sevens Scrabo Hill Race
## Predicted 1.270 -1.15233
## cvpred 1.187 -1.11845
## log(time) 1.362 -1.12479
## CV residual 0.175 -0.00634
##
## Sum of squares = 0.1 Mean square = 0.01 n = 8
##
## Overall (Sum over all 8 folds)
## ms
## 0.00815
Sum the MSE for each fold, divide by the number of observations, and take the square root to get the cross-validated standard error of estimate.
You can assess R2 shrinkage via K-fold cross-validation. Using the crossval() function from the bootstrap package.
Selecting a subset of predictor variables from a larger set (e.g., stepwise selection) is a controversial topic. You can perform stepwise selection (forward, backward, both) using the stepAIC( ) function from the MASS package. stepAIC( ) performs stepwise model selection by exact AIC.
# Stepwise Regression
step <- stepAIC(fit, direction="both")
## Start: AIC=27.5
## Circonference ~ X + Y + NumArbre
##
## Df Sum of Sq RSS AIC
## - Y 1 1.13 54.2 25.9
## - NumArbre 1 2.85 55.9 26.5
## <none> 53.0 27.5
## - X 1 8.08 61.1 28.2
##
## Step: AIC=25.9
## Circonference ~ X + NumArbre
##
## Df Sum of Sq RSS AIC
## - NumArbre 1 1.74 55.9 24.5
## <none> 54.2 25.9
## - X 1 8.80 63.0 26.8
## + Y 1 1.13 53.0 27.5
##
## Step: AIC=24.5
## Circonference ~ X
##
## Df Sum of Sq RSS AIC
## <none> 55.9 24.5
## - X 1 7.87 63.8 25.0
## + NumArbre 1 1.74 54.2 25.9
## + Y 1 0.02 55.9 26.5
step$anova # display results
Alternatively, you can perform all-subsets regression using the leaps( ) function from the leaps package.
The relaimpo package provides measures of relative importance for each of the predictors in the model.
library(relaimpo)
# Calculate Relative Importance for Each Predictor
calc.relimp(fit,type=c("lmg","last","first","pratt"),
rela=TRUE)
## Response variable: Circonference
## Total response variance: 3.54
## Analysis based on 19 observations
##
## 3 Regressors:
## X Y NumArbre
## Proportion of variance explained by model: 16.9%
## Metrics are normalized to sum to 100% (rela=TRUE).
##
## Relative importance metrics:
##
## lmg last first pratt
## X 0.7482 0.670 0.8703 0.7540
## Y 0.0755 0.094 0.0401 0.0731
## NumArbre 0.1763 0.236 0.0896 0.1729
##
## Average coefficients for different model sizes:
##
## 1X 2Xs 3Xs
## X -0.01766 -0.01819 -0.0182
## Y -0.00519 -0.00777 -0.0112
## NumArbre 0.03772 0.06664 0.0864
# Bootstrap Measures of Relative Importance (!! 100 samples)
boot <- boot.relimp(fit, b = 100, type = c("lmg",
"last", "first", "pratt"), rank = TRUE,
diff = TRUE, rela = TRUE)
booteval.relimp(boot) # print result
## Response variable: Circonference
## Total response variance: 3.54
## Analysis based on 19 observations
##
## 3 Regressors:
## X Y NumArbre
## Proportion of variance explained by model: 16.9%
## Metrics are normalized to sum to 100% (rela=TRUE).
##
## Relative importance metrics:
##
## lmg last first pratt
## X 0.7482 0.670 0.8703 0.7540
## Y 0.0755 0.094 0.0401 0.0731
## NumArbre 0.1763 0.236 0.0896 0.1729
##
## Average coefficients for different model sizes:
##
## 1X 2Xs 3Xs
## X -0.01766 -0.01819 -0.0182
## Y -0.00519 -0.00777 -0.0112
## NumArbre 0.03772 0.06664 0.0864
##
##
## Confidence interval information ( 100 bootstrap replicates, bty= perc ):
## Relative Contributions with confidence intervals:
##
## Lower Upper
## percentage 0.95 0.95 0.95
## X.lmg 0.7482 ABC 0.0254 0.9590
## Y.lmg 0.0755 ABC 0.0144 0.6240
## NumArbre.lmg 0.1763 ABC 0.0068 0.8280
##
## X.last 0.6700 ABC 0.0015 0.9680
## Y.last 0.0940 ABC 0.0002 0.6900
## NumArbre.last 0.2360 ABC 0.0063 0.9520
##
## X.first 0.8703 ABC 0.0046 0.9910
## Y.first 0.0401 ABC 0.0002 0.6660
## NumArbre.first 0.0896 ABC 0.0001 0.8870
##
## X.pratt 0.7540 ABC -0.2120 1.0150
## Y.pratt 0.0731 ABC -0.1400 0.8290
## NumArbre.pratt 0.1729 ABC -0.1300 1.0700
##
## Letters indicate the ranks covered by bootstrap CIs.
## (Rank bootstrap confidence intervals always obtained by percentile method)
## CAUTION: Bootstrap confidence intervals can be somewhat liberal.
##
##
## Differences between Relative Contributions:
##
## Lower Upper
## difference 0.95 0.95 0.95
## X-Y.lmg 0.6727 -0.5386 0.9374
## X-NumArbre.lmg 0.5718 -0.7951 0.9524
## Y-NumArbre.lmg -0.1008 -0.7251 0.4180
##
## X-Y.last 0.5756 -0.6131 0.9610
## X-NumArbre.last 0.4332 -0.9397 0.9499
## Y-NumArbre.last -0.1424 -0.9157 0.5406
##
## X-Y.first 0.8302 -0.5933 0.9863
## X-NumArbre.first 0.7806 -0.8572 0.9871
## Y-NumArbre.first -0.0496 -0.8419 0.5868
##
## X-Y.pratt 0.6809 -0.8180 1.0218
## X-NumArbre.pratt 0.5811 -1.1032 1.0431
## Y-NumArbre.pratt -0.0997 -1.1677 0.8540
##
## * indicates that CI for difference does not include 0.
## CAUTION: Bootstrap confidence intervals can be somewhat liberal.
plot(booteval.relimp(boot,sort=TRUE)) # plot result
An excellent review of regression diagnostics is provided in the vignette of the car package provides advanced utilities for regression modeling.
Assume that we are fitting a multiple linear regression on the MTCARS data
library(car)
fit <- lm(mpg~disp+hp+wt+drat, data=mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ disp + hp + wt + drat, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.508 -1.905 -0.506 0.982 5.688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.14874 6.29359 4.63 8.2e-05 ***
## disp 0.00382 0.01080 0.35 0.7268
## hp -0.03478 0.01160 -3.00 0.0058 **
## wt -3.47967 1.07837 -3.23 0.0033 **
## drat 1.76805 1.31978 1.34 0.1915
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.6 on 27 degrees of freedom
## Multiple R-squared: 0.838, Adjusted R-squared: 0.814
## F-statistic: 34.8 on 4 and 27 DF, p-value: 2.7e-10
This example is for exposition only. We will ignore the fact that this may not be a great way of modeling the this particular set of data!
# Assessing Outliers
outlierTest(fit) # Bonferonni p-value for most extreme obs
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## Toyota Corolla 2.52 0.0184 0.588
qqPlot(fit, main="QQ Plot") # qq plot for studentized resid
leveragePlots(fit) # leverage plots
# Influential Observations
# added variable plots
av.plots(fit)
# Cook's D plot
# identify D values > 4/(n-k-1)
cutoff <- 4/((nrow(mtcars)-length(fit$coefficients)-2))
plot(fit, which=4, cook.levels=cutoff)
# Influence Plot
influencePlot(fit, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )
# Normality of Residuals, qq plot for studentized resid
qqPlot(fit, main="QQ Plot")
# distribution of studentized residuals
sresid <- studres(fit)
hist(sresid, freq=FALSE, main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 1.43 Df = 1 p = 0.232
# plot studentized residuals vs. fitted values
spreadLevelPlot(fit)
##
## Suggested power transformation: 0.662
# Evaluate Collinearity
vif(fit) # variance inflation factors
## disp hp wt drat
## 8.21 2.89 5.10 2.28
sqrt(vif(fit)) > 2 # problem?
## disp hp wt drat
## TRUE FALSE TRUE FALSE
# Evaluate Nonlinearity
# component + residual plot
crPlots(fit)
# Ceres plots
ceresPlots(fit)
Ceres plots are a generalization of component+residual (partial residual) plots that are less prone to leakage of nonlinearity among the predictors.
# Test for Autocorrelated Errors
durbinWatsonTest(fit)
## lag Autocorrelation D-W Statistic p-value
## 1 0.101 1.74 0.308
## Alternative hypothesis: rho != 0
The gvlma( ) function in the gvlma package, performs a global validation of linear model assumptions as well separate evaluations of skewness, kurtosis, and heteroscedasticity.
# Global test of model assumptions
library(gvlma)
gvmodel <- gvlma(fit)
summary(gvmodel)
##
## Call:
## lm(formula = mpg ~ disp + hp + wt + drat, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.508 -1.905 -0.506 0.982 5.688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.14874 6.29359 4.63 8.2e-05 ***
## disp 0.00382 0.01080 0.35 0.7268
## hp -0.03478 0.01160 -3.00 0.0058 **
## wt -3.47967 1.07837 -3.23 0.0033 **
## drat 1.76805 1.31978 1.34 0.1915
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.6 on 27 degrees of freedom
## Multiple R-squared: 0.838, Adjusted R-squared: 0.814
## F-statistic: 34.8 on 4 and 27 DF, p-value: 2.7e-10
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = fit)
##
## Value p-value Decision
## Global Stat 13.9382 0.00750 Assumptions NOT satisfied!
## Skewness 4.3131 0.03782 Assumptions NOT satisfied!
## Kurtosis 0.0138 0.90654 Assumptions acceptable.
## Link Function 8.7166 0.00315 Assumptions NOT satisfied!
## Heteroscedasticity 0.8947 0.34421 Assumptions acceptable.
If you have been analyzing ANOVA designs in traditional statistical packages, you are likely to find R’s approach less coherent and user-friendly. A good online presentation on ANOVA in R can be found in ANOVA section of the Personality Project. (Note: I have found that these pages render fine in Chrome and Safari browsers, but can appear distorted in iExplorer.)
str(npk)
## 'data.frame': 24 obs. of 5 variables:
## $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ N : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
## $ P : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
## $ K : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
## $ yield: num 49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...
summary(aov(yield ~ N, npk))
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189 189.3 6.06 0.022 *
## Residuals 22 687 31.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
str(npk)
## 'data.frame': 24 obs. of 5 variables:
## $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ N : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
## $ P : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
## $ K : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
## $ yield: num 49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...
summary(aov(yield ~ N + block, npk))
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189 189.3 9.36 0.0071 **
## block 5 343 68.7 3.40 0.0262 *
## Residuals 17 344 20.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(yield ~ N + P + N:P, npk))
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189 189.3 5.76 0.026 *
## P 1 8 8.4 0.26 0.619
## N:P 1 21 21.3 0.65 0.430
## Residuals 20 657 32.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(yield ~ N*P, npk)) # same thing
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189 189.3 5.76 0.026 *
## P 1 8 8.4 0.26 0.619
## N:P 1 21 21.3 0.65 0.430
## Residuals 20 657 32.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(yield ~ N + rnorm(24, 0, 1), npk))
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189 189.3 6.25 0.021 *
## rnorm(24, 0, 1) 1 51 50.8 1.68 0.209
## Residuals 21 636 30.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.
layout(matrix(c(1,2,3,4),2,2)) # optional layout
plot(aov(yield ~ N, npk)) # diagnostic plots
summary(aov(yield ~ N + P + N:P, npk)) # display Type I ANOVA table
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189 189.3 5.76 0.026 *
## P 1 8 8.4 0.26 0.619
## N:P 1 21 21.3 0.65 0.430
## Residuals 20 657 32.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(aov(yield ~ N + P + N:P, npk),~.,test="F") # type III SS and F Tests
Nonparametric and resampling alternatives are available.
You can get Tukey HSD tests using the function below. By default, it calculates post hoc comparisons on each factor in the model. You can specify specific factors as an option.
# Tukey Honestly Significant Differences
TukeyHSD(aov(yield ~ N + P + N:P, npk)) # where fit comes from aov()
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = yield ~ N + P + N:P, data = npk)
##
## $N
## diff lwr upr p adj
## 1-0 5.62 0.734 10.5 0.026
##
## $P
## diff lwr upr p adj
## 1-0 -1.18 -6.07 3.7 0.619
##
## $`N:P`
## diff lwr upr p adj
## 1:0-0:0 7.50 -1.76 16.76 0.140
## 0:1-0:0 0.70 -8.56 9.96 0.997
## 1:1-0:0 4.43 -4.83 13.70 0.550
## 0:1-1:0 -6.80 -16.06 2.46 0.202
## 1:1-1:0 -3.07 -12.33 6.20 0.791
## 1:1-0:1 3.73 -5.53 13.00 0.677
# Two-way Interaction Plot
attach(mtcars)
gear <- factor(gear)
cyl <- factor(cyl)
interaction.plot(cyl, gear, mpg, type="b", col=c(1:3),
leg.bty="o", leg.bg="beige", lwd=2, pch=c(18,24,22),
xlab="Number of Cylinders",
ylab="Mean Miles Per Gallon",
main="Interaction Plot")
# Plot Means with Error Bars
library(gplots)
attach(mtcars)
cyl <- factor(cyl)
plotmeans(mpg~cyl,xlab="Number of Cylinders",
ylab="Miles Per Gallon", main="Mean Plot\nwith 95% CI")
One of the main reasons data analysts turn to R is for its strong graphic capabilities.
Creating a Graph provides an overview of creating and saving graphs in R.
The remainder of the section describes how to create basic graph types. These include density plots (histograms and kernel density plots), dot plots, bar charts (simple, stacked, grouped), line charts, pie charts (simple, annotated, 3D), boxplots (simple, notched, violin plots, bagplots) and Scatterplots (simple, with fit lines, scatterplot matrices, high density plots, and 3D plots).
The Advanced Graphs section describes how to customize and annotate graphs, and covers more statistically complex types of graphs.
In R, graphs are typically created interactively.
attach(mtcars)
plot(wt, mpg)
abline(lm(mpg~wt))
title("Regression of MPG on Weight")
The plot( ) function opens a graph window and plots weight vs. miles per gallon. The next line of code adds a regression line to this graph. The final line adds a title.
You can save the graph in a variety of formats from the menu File -> Save As
You can also save the graph via code using one of the following functions.
| Function | Output to |
|---|---|
| pdf(“mygraph.pdf”) | pdf file |
| win.metafile(“mygraph.wmf”) | windows metafile |
| png(“mygraph.png”) | png file |
| jpeg(“mygraph.jpg”) | jpeg file |
| bmp(“mygraph.bmp”) | bmp file |
| postscript(“mygraph.ps”) | postscript file |
Creating a new graph by issuing a high level plotting command (plot, hist, boxplot, etc.) will typically overwrite a previous graph. To avoid this, open a new graph window before creating a new graph. To open a new graph window use one of the functions below.
| Function | Platform |
|---|---|
| windows() | Windows |
| X11() | Unix |
| quartz() | Mac |
You can have multiple graph windows open at one time. See help(dev.cur) for more details.
You can specify fonts, colors, line styles, axes, reference lines, etc. by specifying graphical parameters. This allows a wide degree of customization. Graphical parameters, are covered in the Advanced Graphs section. The Advanced Graphs section also includes a more detailed coverage of axis and text customization.
You can create histograms with the function hist(x) where x is a numeric vector of values to be plotted. The option freq=FALSE plots probability densities instead of frequencies. The option breaks= controls the number of bins. #### Simple Histogram
hist(mtcars$mpg)
hist(mtcars$mpg, breaks=12, col="red")
x <- mtcars$mpg
h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon",
main="Histogram with Normal Curve")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
Histograms can be a poor method for determining the shape of a distribution because it is so strongly affected by the number of bins used.
par(mfrow=c(1,2))
hist(mtcars$mpg, breaks=3, col="red")
hist(mtcars$mpg, breaks=6, col="red")
Kernal density plots are usually a much more effective way to view the distribution of a variable. Create the plot using plot(density(x)) where x is a numeric vector.
d <- density(mtcars$mpg) # returns the density data
plot(d) # plots the results
plot(d, main="Kernel Density of Miles Per Gallon")
polygon(d, col="red", border="blue")
The sm.density.compare( ) function in the sm package allows you to superimpose the kernal density plots of two or more groups. The format is sm.density.compare(x, factor) where x is a numeric vector and factor is the grouping variable.
# Compare MPG distributions for cars with 4,6, or 8 cylinders
library(sm)
cyl.f <- factor(cyl, levels= c(4,6,8),
labels = c("4 cylinder", "6 cylinder", "8 cylinder"))
sm.density.compare(mpg, cyl, xlab="Miles Per Gallon")
title(main="MPG Distribution by Car Cylinders")
Create dotplots with the dotchart(x, labels=) function, where x is a numeric vector and labels is a vector of labels for each point. You can add a groups= option to designate a factor specifying how the elements of x are grouped. If so, the option gcolor= controls the color of the groups, label.cex controls the size of the labels.
dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7,
main="Gas Milage for Car Models",
xlab="Miles Per Gallon")
Sort by mpg, group and color by cylinder
x <- mtcars[order(mtcars$mpg),] # sort by mpg
x$cyl <- factor(x$cyl) # it must be a factor
x$color[x$cyl==4] <- "red"
x$color[x$cyl==6] <- "blue"
x$color[x$cyl==8] <- "darkgreen"
dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl,
main="Gas Milage for Car Models\ngrouped by cylinder",
xlab="Miles Per Gallon", gcolor="black", color=x$color)
Create barplots with the barplot(height) function, where height is a vector or matrix. If height is a vector, the values determine the heights of the bars in the plot. If height is a matrix and the option beside=FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked ???sub-bars???. If height is a matrix and beside=TRUE, then the values in each column are juxtaposed rather than stacked. Include option names.arg=(character vector) to label the bars. The option horiz=TRUE to createa a horizontal barplot.
counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution",
xlab="Number of Gears")
counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution", horiz=TRUE,
names.arg=c("3 Gears", "4 Gears", "5 Gears"))
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS",
xlab="Number of Gears", col=c("darkblue","red"),
legend = rownames(counts))
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS",
xlab="Number of Gears", col=c("darkblue","red"),
legend = rownames(counts), beside=TRUE)
Bar plots need not be based on counts or frequencies. You can create bar plots that represent means, medians, standard deviations, etc. Use the aggregate( ) function and pass the results to the barplot( ) function. By default, the categorical axis line is suppressed. Include the option axis.lty=1 to draw it. With many bars, bar labels may start to overlap. You can decrease the font size using the cex.names option. Values smaller than one will shrink the size of the label. Additionally, you can use graphical parameters such as the following to help text spacing:
par(las=2) # make label text perpendicular to axis
par(mar=c(5,8,4,2)) # increase y-axis margin.
counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution", horiz=TRUE, names.arg=c("3 Gears", "4 Gears", "5 Gears"), cex.names=0.8)
Line charts are created with the function lines(x, y, type=) where x and y are numeric vectors of (x,y) points to connect. type= can take the following values:
| Type | Description |
|---|---|
| p | points |
| l | lines |
| o | overplotted points and lines |
| b, c | points (empty if “c”) joined by lines |
| s, S | stair steps |
| h | histogram-like vertical lines |
| n | does not produce any points or lines |
The lines( ) function adds information to a graph. It can not produce a graph on its own. Usually it follows a plot(x, y) command that produces a graph.
By default, plot( ) plots the (x,y) points. Use the type=“n” option in the plot( ) command, to create the graph with axes, titles, etc., but without plotting the points.
Example: In the following code each of the type= options is applied to the same dataset. The plot( ) command sets up the graph, but does not plot the points.
x <- c(1:5); y <- x # create some data
par(pch=22, col="red") # plotting symbol and color
par(mfrow=c(2,4)) # all plots on one page
opts = c("p","l","o","b","c","s","S","h")
for(i in 1:length(opts)){
heading = paste("type=",opts[i])
plot(x, y, type="n", main=heading)
lines(x, y, type=opts[i])
}
Next, we demonstrate each of the type= options when plot( ) sets up the graph and does plot the points.
x <- c(1:5); y <- x # create some data
par(pch=22, col="blue") # plotting symbol and color
par(mfrow=c(2,4)) # all plots on one page
opts = c("p","l","o","b","c","s","S","h")
for(i in 1:length(opts)){
heading = paste("type=",opts[i])
plot(x, y, main=heading)
lines(x, y, type=opts[i])
}
As you can see, the type=“c” option only looks different from the type=“b” option if the plotting of points is suppressed in the plot( ) command.
To demonstrate the creation of a more complex line chart, let’s plot the growth of 5 orange trees over time. Each tree will have its own distinctive line. The data come from the dataset Orange.
# convert factor to numeric for convenience
Orange$Tree <- as.numeric(Orange$Tree)
ntrees <- max(Orange$Tree)
# get the range for the x and y axis
xrange <- range(Orange$age)
yrange <- range(Orange$circumference)
# set up the plot
plot(xrange, yrange, type="n", xlab="Age (days)",
ylab="Circumference (mm)" )
colors <- rainbow(ntrees)
linetype <- c(1:ntrees)
plotchar <- seq(18,18+ntrees,1)
# add lines
for (i in 1:ntrees) {
tree <- subset(Orange, Tree==i)
lines(tree$age, tree$circumference, type="b", lwd=1.5,
lty=linetype[i], col=colors[i], pch=plotchar[i])
}
Pie charts are not recommended in the R documentation, and their features are somewhat limited. The authors recommend bar or dot plots over pie charts because people are able to judge length more accurately than volume. Pie charts are created with the function pie(x, labels=) where x is a non-negative numeric vector indicating the area of each slice and labels= notes a character vector of names for the slices.
slices <- c(10, 12,4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie(slices, labels = lbls, main="Pie Chart of Countries")
slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices,labels = lbls, col=rainbow(length(lbls)),
main="Pie Chart of Countries")
library(plotrix)
slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie3D(slices,labels=lbls,explode=0.1,
main="Pie Chart of Countries ")
Note you can play with explode !
Boxplots can be created for individual variables or for variables by group. The format is boxplot(x, data=), where x is a formula and data= denotes the data frame providing the data. An example of a formula is y~group where a separate boxplot for numeric variable y is generated for each value of group. Add varwidth=TRUE to make boxplot widths proportional to the square root of the samples sizes. Add horizontal=TRUE to reverse the axis orientation.
# Boxplot of MPG by Car Cylinders
boxplot(mpg~cyl,data=mtcars, main="Car Milage Data",
xlab="Number of Cylinders", ylab="Miles Per Gallon")
# Notched Boxplot of Tooth Growth Against 2 Crossed Factors
# boxes colored for ease of interpretation
boxplot(len~supp*dose, data=ToothGrowth, notch=TRUE,
col=(c("gold","darkgreen")),
main="Tooth Growth", xlab="Suppliment and Dose")
A violin plot is a combination of a boxplot and a kernel density plot. They can be created using the vioplot( ) function from vioplot package.
library(vioplot)
x1 <- mtcars$mpg[mtcars$cyl==4]
x2 <- mtcars$mpg[mtcars$cyl==6]
x3 <- mtcars$mpg[mtcars$cyl==8]
vioplot(x1, x2, x3, names=c("4 cyl", "6 cyl", "8 cyl"),
col="gold")
title("Violin Plots of Miles Per Gallon")
There are many ways to create a scatterplot in R. The basic function is plot(x, y), where x and y are numeric vectors denoting the (x,y) points to plot.
attach(mtcars)
plot(wt, mpg, main="Scatterplot Example",
xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)
# Add fit lines
abline(lm(mpg~wt), col="red") # regression line (y~x)
lines(lowess(wt,mpg), col="blue") # lowess line (x,y)
The scatterplot( ) function in the car package offers many enhanced features, including fit lines, marginal box plots, conditioning on a factor, and interactive point identification. Each of these features is optional.
# Enhanced Scatterplot of MPG vs. Weight
# by Number of Car Cylinders
library(car)
scatterplot(mpg ~ wt | cyl, data=mtcars,
xlab="Weight of Car", ylab="Miles Per Gallon",
main="Enhanced Scatter Plot",
labels=row.names(mtcars))
There are at least 4 useful functions for creating scatterplot matrices. Analysts must love scatterplot matrices!
pairs(~mpg+disp+drat+wt,data=mtcars,
main="Simple Scatterplot Matrix")
The car package can condition the scatterplot matrix on a factor, and optionally include lowess and linear best fit lines, and boxplot, densities, or histograms in the principal diagonal, as well as rug plots in the margins of the cells.
library(car)
scatterplot.matrix(~mpg+disp+drat+wt|cyl, data=mtcars,
main="Three Cylinder Options")
The gclus package provides options to rearrange the variables so that those with higher correlations are closer to the principal diagonal. It can also color code the cells to reflect the size of the correlations.
library(gclus)
dta <- mtcars[c(1,3,5,6)] # get data
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Variables Ordered and Colored by Correlation" )
When there are many data points and significant overlap, scatterplots become less useful. There are several approaches that be used when this occurs. The hexbin(x, y) function in the hexbin package provides bivariate binning into hexagonal cells (it looks better than it sounds).
library(hexbin)
x <- rnorm(10000)
y <- rnorm(10000)
bin<-hexbin(x, y, xbins=50)
plot(bin, main="Hexagonal Binning")
You can create a 3D scatterplot with the scatterplot3d package. Use the function scatterplot3d(x, y, z).
library(scatterplot3d)
attach(mtcars)
scatterplot3d(wt,disp,mpg, main="3D Scatterplot")
with Coloring and Vertical Drop Lines
scatterplot3d(wt,disp,mpg, pch=16, highlight.3d=TRUE,
type="h", main="3D Scatterplot")
*3D scatterplot with drop lines click to view
with Coloring and Vertical Lines and Regression Plane
s3d <-scatterplot3d(wt,disp,mpg, pch=16, highlight.3d=TRUE,
type="h", main="3D Scatterplot")
fit <- lm(mpg ~ wt+disp)
s3d$plane3d(fit)
library(vegan)
data(dune)
We study first the lichen pasture data with only continuous constraints. The data sets are taken into use with
data(varespec)
data(varechem)
The environmental data can be inspected using commands str which shows the structure of any object in a compact form, and asking for a summary of a data frame
str(varechem)
## 'data.frame': 24 obs. of 14 variables:
## $ N : num 19.8 13.4 20.2 20.6 23.8 22.8 26.6 24.2 29.8 28.1 ...
## $ P : num 42.1 39.1 67.7 60.8 54.5 40.9 36.7 31 73.5 40.5 ...
## $ K : num 140 167 207 234 181 ...
## $ Ca : num 519 357 973 834 777 ...
## $ Mg : num 90 70.7 209.1 127.2 125.8 ...
## $ S : num 32.3 35.2 58.1 40.7 39.5 40.8 33.8 27.1 42.5 60.2 ...
## $ Al : num 39 88.1 138 15.4 24.2 ...
## $ Fe : num 40.9 39 35.4 4.4 3 ...
## $ Mn : num 58.1 52.4 32.1 132 50.1 ...
## $ Zn : num 4.5 5.4 16.8 10.7 6.6 9.1 7.4 5.2 9.3 9.1 ...
## $ Mo : num 0.3 0.3 0.8 0.2 0.3 0.4 0.3 0.3 0.3 0.5 ...
## $ Baresoil: num 43.9 23.6 21.2 18.7 46 40.5 23 29.8 17.6 29.9 ...
## $ Humdepth: num 2.2 2.2 2 2.9 3 3.8 2.8 2 3 2.2 ...
## $ pH : num 2.7 2.8 3 2.8 2.7 2.7 2.8 2.8 2.8 2.8 ...
summary(varechem)
## N P K Ca
## Min. :13.4 Min. :22.7 Min. : 43.6 Min. : 188
## 1st Qu.:18.8 1st Qu.:32.6 1st Qu.:127.2 1st Qu.: 426
## Median :22.1 Median :41.5 Median :166.6 Median : 518
## Mean :22.4 Mean :45.1 Mean :162.9 Mean : 570
## 3rd Qu.:26.3 3rd Qu.:57.0 3rd Qu.:205.8 3rd Qu.: 739
## Max. :33.1 Max. :73.5 Max. :313.8 Max. :1170
## Mg S Al Fe
## Min. : 25.7 Min. :14.9 Min. : 12 Min. : 2.3
## 1st Qu.: 60.9 1st Qu.:29.4 1st Qu.: 38 1st Qu.: 5.6
## Median : 75.0 Median :36.2 Median :107 Median : 27.8
## Mean : 87.5 Mean :37.2 Mean :142 Mean : 49.6
## 3rd Qu.:108.7 3rd Qu.:43.6 3rd Qu.:234 3rd Qu.: 85.2
## Max. :209.1 Max. :60.2 Max. :435 Max. :204.4
## Mn Zn Mo Baresoil
## Min. : 10.1 Min. : 2.60 Min. :0.050 Min. : 0.0
## 1st Qu.: 26.7 1st Qu.: 5.38 1st Qu.:0.275 1st Qu.: 8.0
## Median : 36.5 Median : 8.10 Median :0.300 Median :21.2
## Mean : 49.3 Mean : 7.60 Mean :0.396 Mean :22.9
## 3rd Qu.: 59.0 3rd Qu.: 9.10 3rd Qu.:0.500 3rd Qu.:30.8
## Max. :132.0 Max. :16.80 Max. :1.100 Max. :56.9
## Humdepth pH
## Min. :1.00 Min. :2.70
## 1st Qu.:1.80 1st Qu.:2.80
## Median :2.20 Median :2.90
## Mean :2.20 Mean :2.93
## 3rd Qu.:2.62 3rd Qu.:3.00
## Max. :3.80 Max. :3.60
There is a difference between a data frame and a matrix in R: data frame is actually a list of variables which can be of different types (continuous, factors, ordered factors), whereas matrix only contains numbers of the same type. ` The dependencies among variables can be inspected visually using plot com- mand for data frames; under the hood this calls pairs function
plot(varechem, gap=0, panel=panel.smooth)
It is always useful to have a look at the data before rushing into analysis. It is necessary to have a look at the data if you read in your own data: you really must check that the data were imported correctly.
You can refer to rows and columns in the data frame in various ways. One way is to use numeric indices within square brackets ???[ ]???. The first item refers to a row, the second item to a column
varechem[3,7]
## [1] 138
Several items can be referred to by combining indices within c() or by using a regular sequence with ???:??? between first and last item:
varechem[2, c(3,1, 7)]
varechem[3:5, 7]
## [1] 138.0 15.4 24.2
If you omit one index, whole row or whole column will be given to you:
varechem[2,]
varechem[,7]
## [1] 39.0 88.1 138.0 15.4 24.2 104.8 20.7 74.2 17.9 329.7 92.3
## [12] 124.3 12.1 294.9 39.0 155.1 304.6 435.1 316.5 227.1 108.8 168.2
## [23] 253.6 35.8
Finally, you can also refer to rows and columns by their names instead of numeric indices, and in data frames you can also use the dollar sign for variables:
varechem[, "pH"]
## [1] 2.7 2.8 3.0 2.8 2.7 2.7 2.8 2.8 2.8 2.8 2.7 2.9 2.9 3.1 3.1 3.0 3.3
## [18] 2.9 2.9 3.0 2.9 3.2 3.6 3.0
varechem$pH
## [1] 2.7 2.8 3.0 2.8 2.7 2.7 2.8 2.8 2.8 2.8 2.7 2.9 2.9 3.1 3.1 3.0 3.3
## [18] 2.9 2.9 3.0 2.9 3.2 3.6 3.0
Function metaMDS is in the vegan library. The function calls internally a function that runs the actual NMDS. The default NMDS engine is function monoMDS in the vegan package.
Running NMDS takes only one command:
m <- metaMDS(dune)
## Run 0 stress 0.119
## Run 1 stress 0.119
## ... Procrustes: rmse 0.000317 max resid 0.000976
## ... Similar to previous best
## Run 2 stress 0.119
## ... Procrustes: rmse 9.7e-05 max resid 0.000297
## ... Similar to previous best
## Run 3 stress 0.118
## ... New best solution
## ... Procrustes: rmse 0.0203 max resid 0.0649
## Run 4 stress 0.118
## ... Procrustes: rmse 1.5e-05 max resid 4.77e-05
## ... Similar to previous best
## Run 5 stress 0.181
## Run 6 stress 0.118
## ... New best solution
## ... Procrustes: rmse 7.2e-06 max resid 1.76e-05
## ... Similar to previous best
## Run 7 stress 0.181
## Run 8 stress 0.118
## ... New best solution
## ... Procrustes: rmse 5.09e-06 max resid 1.62e-05
## ... Similar to previous best
## Run 9 stress 0.118
## ... Procrustes: rmse 2.19e-05 max resid 6.7e-05
## ... Similar to previous best
## Run 10 stress 0.204
## Run 11 stress 0.208
## Run 12 stress 0.119
## Run 13 stress 0.118
## ... Procrustes: rmse 1.04e-05 max resid 3.52e-05
## ... Similar to previous best
## Run 14 stress 0.119
## Run 15 stress 0.118
## ... Procrustes: rmse 3.26e-06 max resid 8.31e-06
## ... Similar to previous best
## Run 16 stress 0.119
## Run 17 stress 0.181
## Run 18 stress 0.208
## Run 19 stress 0.181
## Run 20 stress 0.119
## *** Solution reached
Sometimes the metaMDS does not find the same best solution twice, and in that case you cannot be sure that the found best solution really is the global optimum. If you are uncertain, you can continue the iterations from your current solution by giving the name of your solution in the argument
m <- metaMDS(dune, previous=m)
## Starting from 2-dimensional configuration
## Run 0 stress 0.118
## Run 1 stress 0.118
## ... Procrustes: rmse 1.21e-05 max resid 3.54e-05
## ... Similar to previous best
## Run 2 stress 0.118
## ... Procrustes: rmse 1.52e-05 max resid 4.02e-05
## ... Similar to previous best
## Run 3 stress 0.181
## Run 4 stress 0.118
## ... Procrustes: rmse 1.69e-05 max resid 4.28e-05
## ... Similar to previous best
## Run 5 stress 0.119
## Run 6 stress 0.118
## ... Procrustes: rmse 2.59e-05 max resid 8.14e-05
## ... Similar to previous best
## Run 7 stress 0.119
## Run 8 stress 0.119
## Run 9 stress 0.119
## Run 10 stress 0.119
## Run 11 stress 0.119
## Run 12 stress 0.118
## ... Procrustes: rmse 4.74e-05 max resid 0.000155
## ... Similar to previous best
## Run 13 stress 0.119
## Run 14 stress 0.119
## Run 15 stress 0.119
## Run 16 stress 0.118
## ... Procrustes: rmse 8.48e-06 max resid 2.51e-05
## ... Similar to previous best
## Run 17 stress 0.119
## Run 18 stress 0.118
## ... Procrustes: rmse 1.03e-05 max resid 2.57e-05
## ... Similar to previous best
## Run 19 stress 0.118
## ... Procrustes: rmse 2.42e-05 max resid 7.7e-05
## ... Similar to previous best
## Run 20 stress 0.118
## ... Procrustes: rmse 1.07e-05 max resid 2.85e-05
## ... Similar to previous best
## *** Solution reached
We saved the results of the analysis in object called m. You can use any decent name. Decent names do not have spaces, they do not begin with a number, and they do not contain special characters that can be interpreted as mathematical operators such as +-/:*$&!??. If you do not save the results, they will be printed on the screen and will be lost once the screen scrolls out from the sight. The symbol <- saves the results, but you can also use more familiar =. If you type the name of your result object, the object will be printed on the screen
m
##
## Call:
## metaMDS(comm = dune, previous.best = m)
##
## global Multidimensional Scaling using monoMDS
##
## Data: dune
## Distance: bray
##
## Dimensions: 2
## Stress: 0.118
## Stress type 1, weak ties
## Two convergent solutions found after 40 tries
## Scaling: centring, PC rotation, halfchange scaling
## Species: expanded scores based on 'dune'
However, this is only a printed presentation of the object, and the the result object contains much more detailed information. For instance, the ordination scores are not displayed, although they also are saved. You can see this by plotting the object
plot(m)
The default plot command used black circles for sample plots and red crosses for species. You can see the names for both by defining argument type=“t” for text
plot(m, type="t")
The metaMDS function decided many things for you. For instance, you did not need to define or calculate the dissimilarity index, but the function automatically used Bray-Curtis which is a popular choice. However, you can use any of the indices defined in function vegdist, or you can even define the name of some other function to calculate the dissimilarities. For instance, you can use the Euclidean distances which are commonly regarded as a poor choice for community data
m2 <- metaMDS(dune, dist="euclid")
## Run 0 stress 0.117
## Run 1 stress 0.117
## ... Procrustes: rmse 1.51e-06 max resid 3.31e-06
## ... Similar to previous best
## Run 2 stress 0.117
## ... New best solution
## ... Procrustes: rmse 1.24e-06 max resid 2.68e-06
## ... Similar to previous best
## Run 3 stress 0.118
## ... Procrustes: rmse 0.0171 max resid 0.0553
## Run 4 stress 0.118
## ... Procrustes: rmse 0.0171 max resid 0.0553
## Run 5 stress 0.117
## ... Procrustes: rmse 4.92e-07 max resid 9.12e-07
## ... Similar to previous best
## Run 6 stress 0.118
## ... Procrustes: rmse 0.0171 max resid 0.0553
## Run 7 stress 0.117
## ... Procrustes: rmse 3.43e-06 max resid 7.18e-06
## ... Similar to previous best
## Run 8 stress 0.118
## ... Procrustes: rmse 0.0171 max resid 0.0553
## Run 9 stress 0.118
## ... Procrustes: rmse 0.0171 max resid 0.0553
## Run 10 stress 0.118
## ... Procrustes: rmse 0.0171 max resid 0.0553
## Run 11 stress 0.117
## ... Procrustes: rmse 1.6e-06 max resid 4.36e-06
## ... Similar to previous best
## Run 12 stress 0.118
## ... Procrustes: rmse 0.0171 max resid 0.0552
## Run 13 stress 0.117
## ... Procrustes: rmse 8.73e-07 max resid 2.41e-06
## ... Similar to previous best
## Run 14 stress 0.118
## ... Procrustes: rmse 0.0171 max resid 0.0553
## Run 15 stress 0.117
## ... Procrustes: rmse 1.03e-06 max resid 2.78e-06
## ... Similar to previous best
## Run 16 stress 0.117
## ... Procrustes: rmse 4.1e-06 max resid 1.25e-05
## ... Similar to previous best
## Run 17 stress 0.118
## ... Procrustes: rmse 0.0171 max resid 0.0553
## Run 18 stress 0.118
## ... Procrustes: rmse 0.0171 max resid 0.0553
## Run 19 stress 0.118
## ... Procrustes: rmse 0.0171 max resid 0.0553
## Run 20 stress 0.117
## ... Procrustes: rmse 8.96e-07 max resid 2.38e-06
## ... Similar to previous best
## *** Solution reached
m2
##
## Call:
## metaMDS(comm = dune, distance = "euclid")
##
## global Multidimensional Scaling using monoMDS
##
## Data: dune
## Distance: euclidean
##
## Dimensions: 2
## Stress: 0.117
## Stress type 1, weak ties
## Two convergent solutions found after 20 tries
## Scaling: centring, PC rotation
## Species: expanded scores based on 'dune'
You can compare the shape of the nonlinear regression with these two method using function stressplot
stressplot(m)
stressplot(m2)
You can compare the shape of the nonlinear regression with these two method using function stressplot
stressplot(m)
stressplot(m2)
How do the stress plots differ from each other? You can directly compare the similarity of the results using Procrustes rotation (do so, and describe the differences)
plot(procrustes(m, m2))
PCA can be run with command rda, or you can use standard R commands prcomp or princomp. We shall later use rda for constrained ordination, and therefore we use it also for PCA
ord <- rda(dune)
ord
## Call: rda(X = dune)
##
## Inertia Rank
## Total 84.1
## Unconstrained 84.1 19
## Inertia is variance
##
## Eigenvalues for unconstrained axes:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 24.80 18.15 7.63 7.15 5.70 4.33 3.20 2.78
## (Showed only 8 of all 19 unconstrained eigenvalues)
The results can be plotted with many alternative scaling systems for sites and species. The scales are described in help(scores.cca), and discussed deeply in vignette on Design Decisions which can be accessed using vegan command vegandocs(). You can inspect the visual effects by scrolling through the alternatives
plot(ord)
plot(ord, scal=1)
plot(ord, scal=3)
Study these alternative plot scalings: see ?scores.cca for explanation.
?scores.cca
In addition to standard plot function, you can also use biplot function which uses arrows for species instead of points.
biplot(ord, scal=2)
To give equal weights to all species, you should specify a new argument in the call
sord <- rda(dune, scal=TRUE)
sord
## Call: rda(X = dune, scale = TRUE)
##
## Inertia Rank
## Total 30
## Unconstrained 30 19
## Inertia is correlations
##
## Eigenvalues for unconstrained axes:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 7.03 5.00 3.55 2.64 2.14 1.76 1.48 1.32
## (Showed only 8 of all 19 unconstrained eigenvalues)
plot(ord)
plot(sord)
biplot(sord)
How the results differ from unscaled PCA?
Eigenvector ordinations implement linear mapping of Euclidean distances onto ordination (check lectures). Nonmetric Multidimensional Scaling (NMDS) used nonlinear mapping of any distance or dissimilarity measure. You can study the effect of both nonlinear mapping and non-Euclidean distances using Procrustes rotation
plot(procrustes(m, ord))
plot(procrustes(m2, ord))
CA analysis is similar to PCA, but uses command cca instead of rda
mca = cca(dune)
mca
## Call: cca(X = dune)
##
## Inertia Rank
## Total 2.12
## Unconstrained 2.12 19
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8
## 0.536 0.400 0.260 0.176 0.145 0.108 0.092 0.081
## (Showed only 8 of all 19 unconstrained eigenvalues)
Compare the output of cca to that of rda and find out how the differ (lecture slides may be useful). The plotting happens similarly as in PCA, and again there are several scaling alternatives; their description can be found with help(scores.cca), and inspected visually
plot(mca)
plot(mca, scal=1)
plot(mca, scal=2)
plot(mca, scal=3)
We have already used standard plot function for ordination result. This is a quick and dirty solution to give the first impression on the results. For clean or publishable results we may need to exercise a closer control on the results, and therefore we need to understand the inner workings of R plots. The basic plot command does not only draw a graph, but most importantly it sets the plotting area and plotting scales. After plot you can add several new elements to the created plot. For instance, you add texts in margins, and therefore the plot command reserves empty space in the margins
plot(mca)
title(main = "Correspondence Analysis")
There is empty space in the margin in case you want to add some comments there. Often you do not want to have anything there, and you can set narrower margins to get a larger plotting area. There are many other graphical parameters that can be set. Many of these are described in ?par. The margins are set by parameter called mar which sets the margins as four numbers of margin widths in lines (rows of text) in order bottom, left, top, right. The following sets a bit narrower margins than the default
par(mar=c(4,4,1,1)+.1)
plot(mca)
You have only a limited control of basic plot. For instance, you can select either type=“t” (text) or type=“p” (points), but you cannot use points for plots and text for species, or you cannot select colours or sizes of symbols separately. However, you can first draw an empty plot (type=“n”, none), and then use commands points or text that add items to an existing plot
plot(mca, type="n")
points(mca, display = "sites", col="blue", pch=16)
text(mca, col="red", dis="sp")
Both functions are configurable as you see in ?text and ?points. Plotting characters (pch) of points can be given either as numbers (described in the help page), or as symbols (such as pch = “+”). For a quick survey of choices you can use commands example(text) and example(points)
Ordination plots are often crowded and messy: names are written over each other and may be difficult to read. For publications you need cleaner plots. One alternative is to use opaque labels for text with function ordilabel. These will still cover each other, but at least the uppermost will be readable. With argument priority you can select which labels are uppermost. The following draws an empty plot, adds sample plots as points, and then species names on opaque labels giving higher priority to more abundant species (high column sums)
plot(mca, type = "n")
points(mca, display="sites")
abu <- colSums(dune)
ordilabel(mca, col="red", dis="sp", fill="peachpuff", priority=abu)
Another alternative is to use function orditorp which uses text only if this does not cover previously added names, and points otherwise. The function also knows the priority argument
plot(mca, type = "n")
points(mca, display="sites", pch=23, col="red", bg="yellow")
orditorp(mca, dis = "sp", prio=abu, pch="+", pcol="gray")
Finally there is function ordipointlabel which uses both points and text to label the points. The function tries to locate the text to minimize the overlap. This is a slow numerical process, and will reach different results in most times, but can sometimes give decent results automatically (there are data sets with so many species or observations that it is impossible to label all neatly).
ordipointlabel(mca)
As a second example we examine a larger data set where a number of Dutch ditches were poisoned with insecticide Pyrifos, and the effects of the impact and recovery of animal plankton was followed. The sampling design is regular and the environmental data are automatically generated with command example, and the data set is described more thoroughly in help(pyrifos)
data(pyrifos)
example(pyrifos)
##
## pyrifs> data(pyrifos)
##
## pyrifs> ditch <- gl(12, 1, length=132)
##
## pyrifs> week <- gl(11, 12, labels=c(-4, -1, 0.1, 1, 2, 4, 8, 12, 15, 19, 24))
##
## pyrifs> dose <- factor(rep(c(0.1, 0, 0, 0.9, 0, 44, 6, 0.1, 44, 0.9, 0, 6), 11))
We use Detrended Correspondence Analysis (DCA) to demonstrate also that method
ditch
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11
## [24] 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10
## [47] 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9
## [70] 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8
## [93] 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7
## [116] 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12
dose
## [1] 0.1 0 0 0.9 0 44 6 0.1 44 0.9 0 6 0.1 0 0 0.9 0
## [18] 44 6 0.1 44 0.9 0 6 0.1 0 0 0.9 0 44 6 0.1 44 0.9
## [35] 0 6 0.1 0 0 0.9 0 44 6 0.1 44 0.9 0 6 0.1 0 0
## [52] 0.9 0 44 6 0.1 44 0.9 0 6 0.1 0 0 0.9 0 44 6 0.1
## [69] 44 0.9 0 6 0.1 0 0 0.9 0 44 6 0.1 44 0.9 0 6 0.1
## [86] 0 0 0.9 0 44 6 0.1 44 0.9 0 6 0.1 0 0 0.9 0 44
## [103] 6 0.1 44 0.9 0 6 0.1 0 0 0.9 0 44 6 0.1 44 0.9 0
## [120] 6 0.1 0 0 0.9 0 44 6 0.1 44 0.9 0 6
## Levels: 0 0.1 0.9 6 44
dca <- decorana(pyrifos)
dca
##
## Call:
## decorana(veg = pyrifos)
##
## Detrended correspondence analysis with 26 segments.
## Rescaling of axes with 4 iterations.
##
## DCA1 DCA2 DCA3 DCA4
## Eigenvalues 0.135 0.0884 0.0777 0.0629
## Decorana values 0.140 0.0933 0.0754 0.0651
## Axis lengths 1.516 1.7717 1.3735 1.5401
Compare the display of this analysis to other methods. The following gives ordinary CA for comparison
cap <- cca(pyrifos)
cap
## Call: cca(X = pyrifos)
##
## Inertia Rank
## Total 2.87
## Unconstrained 2.87 131
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8
## 0.1399 0.0981 0.0903 0.0834 0.0772 0.0733 0.0695 0.0655
## (Showed only 8 of all 131 unconstrained eigenvalues)
plot(procrustes(cap, dca))
## Warning in procrustes(cap, dca): X has fewer axes than Y: X adjusted to comform Y
plot(procrustes(dca, cap, choices=1:2))
Spinning is good for private use since it helps in understanding 3D structures, but it is difficult for papers. An alternative plotting command used Lattice or Trellis commands to produce panels of plots or alternative special plots. The following produces separate panels for each level of Pyrifos, displays each ditch with a different colour, and connects points within ditches by lines showing the temporal succession.
ordixyplot(dca, form = DCA2 ~ DCA1 | dose, group=ditch, type="b")
The form says that draw axis DCA2 against DCA1 making panel for each dose, and use lines and points (type = “b” or ???both??? lines and points) for each ditch. Please note that the sign before dose is not a slash but a vertical bar (|). Explain with this graph how the natural annual succession and the impact of Pyrifos can be seen.
The basic command to fit vectors or factor levels of environmental variables is envfit. The following example uses the two alternative result of NMDS Which environmental variables are most important? Which of these alternative ordinations seems to be better related to the environment?
data(dune.env)
envfit(m, dune.env)
##
## ***VECTORS
##
## NMDS1 NMDS2 r2 Pr(>r)
## A1 0.965 0.263 0.36 0.019 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
##
## ***FACTORS:
##
## Centroids:
## NMDS1 NMDS2
## Moisture1 -0.51 -0.04
## Moisture2 -0.39 0.01
## Moisture4 0.28 -0.40
## Moisture5 0.66 0.15
## ManagementBF -0.45 -0.01
## ManagementHF -0.26 -0.13
## ManagementNM 0.30 0.58
## ManagementSF 0.15 -0.47
## UseHayfield -0.16 0.32
## UseHaypastu -0.04 -0.34
## UsePasture 0.29 0.08
## Manure0 0.30 0.58
## Manure1 -0.25 -0.02
## Manure2 -0.31 -0.19
## Manure3 0.31 -0.25
## Manure4 -0.35 -0.56
##
## Goodness of fit:
## r2 Pr(>r)
## Moisture 0.50 0.001 ***
## Management 0.41 0.009 **
## Use 0.19 0.122
## Manure 0.42 0.019 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
ef <- envfit(m, dune.env, perm=1000)
ef
##
## ***VECTORS
##
## NMDS1 NMDS2 r2 Pr(>r)
## A1 0.965 0.263 0.36 0.016 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 1000
##
## ***FACTORS:
##
## Centroids:
## NMDS1 NMDS2
## Moisture1 -0.51 -0.04
## Moisture2 -0.39 0.01
## Moisture4 0.28 -0.40
## Moisture5 0.66 0.15
## ManagementBF -0.45 -0.01
## ManagementHF -0.26 -0.13
## ManagementNM 0.30 0.58
## ManagementSF 0.15 -0.47
## UseHayfield -0.16 0.32
## UseHaypastu -0.04 -0.34
## UsePasture 0.29 0.08
## Manure0 0.30 0.58
## Manure1 -0.25 -0.02
## Manure2 -0.31 -0.19
## Manure3 0.31 -0.25
## Manure4 -0.35 -0.56
##
## Goodness of fit:
## r2 Pr(>r)
## Moisture 0.50 0.002 **
## Management 0.41 0.004 **
## Use 0.19 0.138
## Manure 0.42 0.019 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 1000
ef2 <- envfit(m2, dune.env, perm=1000)
ef2
##
## ***VECTORS
##
## NMDS1 NMDS2 r2 Pr(>r)
## A1 0.975 0.221 0.2 0.13
## Permutation: free
## Number of permutations: 1000
##
## ***FACTORS:
##
## Centroids:
## NMDS1 NMDS2
## Moisture1 -4.52 0.75
## Moisture2 -1.82 -1.72
## Moisture4 2.88 -4.24
## Moisture5 4.73 1.44
## ManagementBF -4.83 0.05
## ManagementHF -3.25 -0.52
## ManagementNM 2.85 4.34
## ManagementSF 2.27 -3.94
## UseHayfield -0.65 2.41
## UseHaypastu -0.60 -2.77
## UsePasture 1.88 1.07
## Manure0 2.85 4.34
## Manure1 -2.96 0.02
## Manure2 -3.66 -1.30
## Manure3 2.66 -1.87
## Manure4 -1.41 -4.48
##
## Goodness of fit:
## r2 Pr(>r)
## Moisture 0.51 0.001 ***
## Management 0.52 0.001 ***
## Use 0.17 0.173
## Manure 0.46 0.018 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 1000
We can add the fitted variables to the model as vectors and centroids of factor levels. The vectors will be automatically scaled for the graph area. The following shows only the variables that were regarded as statistically significant at level P ??? 0.05:
plot(m)
plot(ef, add=T, p.=0.05)
plot(m2)
plot(ef2, add=T, p.=0.05)
If you only have continuous variables, you can constrain the ordination by adding a second argument to the call of cca or rda
m <- cca(varespec)
mm <- cca(varespec, varechem)
m
## Call: cca(X = varespec)
##
## Inertia Rank
## Total 2.08
## Unconstrained 2.08 23
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8
## 0.525 0.357 0.234 0.195 0.178 0.122 0.115 0.089
## (Showed only 8 of all 23 unconstrained eigenvalues)
mm
## Call: cca(X = varespec, Y = varechem)
##
## Inertia Proportion Rank
## Total 2.083 1.000
## Constrained 1.441 0.692 14
## Unconstrained 0.642 0.308 9
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for constrained axes:
## CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7 CCA8 CCA9 CCA10 CCA11 CCA12
## 0.439 0.292 0.163 0.142 0.118 0.089 0.070 0.058 0.031 0.013 0.008 0.007
## CCA13 CCA14
## 0.006 0.005
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8 CA9
## 0.1978 0.1419 0.1012 0.0708 0.0533 0.0333 0.0189 0.0151 0.0095
Compare the output of these two ordinations. In particular, see how the inertia and and rank (number of axes) are decomposed in constrained ordination. The only way to select environmental variables from the full data set is to use a subset of the matrix:
cca(varespec, varechem[, c("Al", "P", "K")])
## Call: cca(X = varespec, Y = varechem[, c("Al", "P", "K")])
##
## Inertia Proportion Rank
## Total 2.083 1.000
## Constrained 0.644 0.309 3
## Unconstrained 1.439 0.691 20
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for constrained axes:
## CCA1 CCA2 CCA3
## 0.362 0.170 0.113
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8
## 0.350 0.220 0.185 0.155 0.135 0.100 0.077 0.054
## (Showed only 8 of all 20 unconstrained eigenvalues)
For a full numerical survey, you can use summary. The summary lists scores for every species and every site ??? and for sites two different kind of scores. The output is long, but if you only want to get an idea of the style of the results, you can delimit the output to the head (or tail) of each type of scores:
head(summary(mm))
##
## Call:
## cca(X = varespec, Y = varechem)
##
## Partitioning of mean squared contingency coefficient:
## Inertia Proportion
## Total 2.083 1.000
## Constrained 1.441 0.692
## Unconstrained 0.642 0.308
##
## Eigenvalues, and their contribution to the mean squared contingency coefficient
##
## Importance of components:
## CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7
## Eigenvalue 0.439 0.292 0.1628 0.1421 0.1180 0.0890 0.0703
## Proportion Explained 0.211 0.140 0.0782 0.0682 0.0566 0.0427 0.0337
## Cumulative Proportion 0.211 0.351 0.4289 0.4971 0.5537 0.5965 0.6302
## CCA8 CCA9 CCA10 CCA11 CCA12 CCA13
## Eigenvalue 0.0584 0.0311 0.01329 0.00836 0.00654 0.00616
## Proportion Explained 0.0280 0.0149 0.00638 0.00402 0.00314 0.00296
## Cumulative Proportion 0.6583 0.6732 0.67958 0.68359 0.68673 0.68969
## CCA14 CA1 CA2 CA3 CA4 CA5 CA6
## Eigenvalue 0.00473 0.1978 0.1419 0.1012 0.0708 0.0533 0.0333
## Proportion Explained 0.00227 0.0949 0.0681 0.0486 0.0340 0.0256 0.0160
## Cumulative Proportion 0.69196 0.7869 0.8550 0.9036 0.9376 0.9631 0.9791
## CA7 CA8 CA9
## Eigenvalue 0.01887 0.01510 0.00949
## Proportion Explained 0.00906 0.00725 0.00455
## Cumulative Proportion 0.98820 0.99545 1.00000
##
## Accumulated constrained eigenvalues
## Importance of components:
## CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7 CCA8
## Eigenvalue 0.439 0.292 0.163 0.1421 0.1180 0.0890 0.0703 0.0584
## Proportion Explained 0.304 0.202 0.113 0.0986 0.0818 0.0618 0.0488 0.0405
## Cumulative Proportion 0.304 0.507 0.620 0.7184 0.8003 0.8620 0.9108 0.9513
## CCA9 CCA10 CCA11 CCA12 CCA13 CCA14
## Eigenvalue 0.0311 0.01329 0.00836 0.00654 0.00616 0.00473
## Proportion Explained 0.0216 0.00922 0.00580 0.00454 0.00427 0.00328
## Cumulative Proportion 0.9729 0.98211 0.98791 0.99245 0.99672 1.00000
##
## Scaling 2 for species and site scores
## * Species are scaled proportional to eigenvalues
## * Sites are unscaled: weighted dispersion equal on all dimensions
##
##
## Species scores
##
## CCA1 CCA2 CCA3 CCA4 CCA5 CCA6
## Callvulg 0.0753 -0.9358 1.6777 0.6955 1.0775 -0.34500
## Empenigr -0.1813 0.0761 0.0365 -0.4277 -0.1382 0.01052
## Rhodtome -1.0535 -0.0603 0.0774 -0.9389 -0.2139 -0.51803
## Vaccmyrt -1.2774 0.3076 0.3037 -0.0921 -0.5688 -0.61302
## Vaccviti -0.1526 0.1205 -0.0530 -0.3623 0.0839 0.00894
## Pinusylv 0.2430 0.2643 0.2233 -0.2738 0.2921 -0.06334
## ....
##
##
## Site scores (weighted averages of species scores)
##
## CCA1 CCA2 CCA3 CCA4 CCA5 CCA6
## 18 0.178 -1.060 -0.4088 -0.6072 -0.565 0.2417
## 15 -0.970 -0.197 0.4210 0.3032 0.152 0.8039
## 24 -1.280 0.476 -2.9469 0.3929 3.954 0.7659
## 27 -1.501 0.652 0.0858 0.7621 -1.233 -0.0976
## 23 -0.598 -0.184 -0.1356 -1.1642 -0.302 0.0703
## 19 -0.110 0.714 0.0166 -0.0777 -0.552 -0.0826
## ....
##
##
## Site constraints (linear combinations of constraining variables)
##
## CCA1 CCA2 CCA3 CCA4 CCA5 CCA6
## 18 -0.423 -1.325 -0.4921 -0.945 -0.0485 0.940
## 15 -0.190 0.497 0.4545 -0.530 -0.0766 -0.790
## 24 -0.863 0.252 -2.7604 0.570 3.2927 0.263
## 27 -1.698 0.487 -0.5635 1.074 -0.6141 0.499
## 23 -0.796 0.107 0.2575 -0.904 -0.2876 0.439
## 19 -0.677 1.001 0.0334 -1.004 -0.1413 -0.938
## ....
##
##
## Biplot scores for constraining variables
##
## CCA1 CCA2 CCA3 CCA4 CCA5 CCA6
## N -0.2229 -0.5289 0.00673 0.1773 -0.25322 0.10201
## P -0.3187 0.5789 -0.16200 0.4795 0.18410 -0.12183
## K -0.3661 0.3079 0.35982 0.4794 0.32544 -0.19664
## Ca -0.4476 0.4218 -0.03776 0.0983 0.30797 0.04355
## Mg -0.4350 0.3405 -0.14217 0.1079 0.49784 -0.00576
## S -0.0241 0.4157 0.14838 0.4445 0.59706 -0.16630
## Al 0.7698 -0.0475 0.03761 0.3910 0.16090 -0.33655
## Fe 0.6491 -0.0881 -0.04207 0.2630 -0.06981 -0.11134
## Mn -0.7223 0.2246 0.11305 0.2915 -0.13868 0.18047
## Zn -0.3581 0.3349 -0.27792 0.3457 0.61919 -0.00119
## Mo 0.2041 -0.1033 -0.15701 0.3242 0.51644 -0.31352
## Baresoil -0.5367 -0.2548 0.13691 -0.5206 0.16662 -0.35241
## Humdepth -0.6967 0.2016 0.27163 -0.1357 -0.00325 -0.05135
## pH 0.4972 0.0751 -0.32634 0.0209 -0.14557 -0.05909
The summary also shows how the total inertia is divided between individual axes, and the lines for accounted inertia shows the accumulated variation explained by all axes, as well as the accounted constrained inertia. The two kind of site scores are often called WA scores and LC scores. The WA scores are derived from species scores, and the LC scores are derived from constraints as their linear combinations. The analysis tries to keep these two sets as similar as possible. Their similarity can be inspected with species??? environment correlation which simply is the correlation between LC and WA score for an axis.
spenvcor(mm)
## CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7 CCA8 CCA9 CCA10 CCA11 CCA12
## 0.927 0.929 0.929 0.927 0.815 0.775 0.900 0.794 0.639 0.755 0.402 0.570
## CCA13 CCA14
## 0.592 0.588
The default plot displays species scores, WA scores and environmental variables.
plot(mm)
You can change this by giving the displayed elements in argument display. The following shows only LC scores and environmental variables:
plot(mm, display = c("lc","bp"))
The items are called “sp”, “wa”, “lc”, “bp”, “cn”, which are self explanatory, except the last which refers to the centroids of the environmental variables. You can visually inspect the difference between WA and LC scores (or the species ??? environment relationship) with command ordispider which (among other alternatives) joins WA and LC scores by a line:
plot(mm, dis=c("wa","lc"))
ordispider(mm)
It is often a bad idea to use constrained ordination if you have a very large number of constraints: probably you will not constrain at all, but your analysis may be very similar to ordinary unconstrained analysis that could have been used just as well. You can inspect this with Procrustes rotation:
plot(procrustes(m, mm))
You can also see how similar the ordinations are by fitting environmental variables to unconstrained ordination.
plot(m)
plot(envfit(m, varechem))
You can use formula interface to select the variables used as constraints:
cca(varespec ~ Al + P + K, data=varechem)
## Call: cca(formula = varespec ~ Al + P + K, data = varechem)
##
## Inertia Proportion Rank
## Total 2.083 1.000
## Constrained 0.644 0.309 3
## Unconstrained 1.439 0.691 20
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for constrained axes:
## CCA1 CCA2 CCA3
## 0.362 0.170 0.113
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8
## 0.350 0.220 0.185 0.155 0.135 0.100 0.077 0.054
## (Showed only 8 of all 20 unconstrained eigenvalues)
This indeed is the recommended method of defining the model, because it allows full control of the model and some further analyses are possible only when the model was defined through a formula. With formula interface you can also use factor constraints:
data(dune)
data(dune.env)
str(dune.env)
## 'data.frame': 20 obs. of 5 variables:
## $ A1 : num 2.8 3.5 4.3 4.2 6.3 4.3 2.8 4.2 3.7 3.3 ...
## $ Moisture : Ord.factor w/ 4 levels "1"<"2"<"4"<"5": 1 1 2 2 1 1 1 4 3 2 ...
## $ Management: Factor w/ 4 levels "BF","HF","NM",..: 4 1 4 4 2 2 2 2 2 1 ...
## $ Use : Ord.factor w/ 3 levels "Hayfield"<"Haypastu"<..: 2 2 2 2 1 2 3 3 1 1 ...
## $ Manure : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 3 5 5 3 3 4 4 2 2 ...
summary(dune.env)
## A1 Moisture Management Use Manure
## Min. : 2.80 1:7 BF:3 Hayfield:7 0:6
## 1st Qu.: 3.50 2:4 HF:5 Haypastu:8 1:3
## Median : 4.20 4:2 NM:6 Pasture :5 2:4
## Mean : 4.85 5:7 SF:6 3:4
## 3rd Qu.: 5.72 4:3
## Max. :11.50
mdun <- cca(dune ~ Management + A1, dune.env)
mdun
## Call: cca(formula = dune ~ Management + A1, data = dune.env)
##
## Inertia Proportion Rank
## Total 2.115 1.000
## Constrained 0.780 0.369 4
## Unconstrained 1.335 0.631 15
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for constrained axes:
## CCA1 CCA2 CCA3 CCA4
## 0.319 0.237 0.132 0.092
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8 CA9 CA10 CA11 CA12
## 0.362 0.203 0.153 0.135 0.111 0.080 0.077 0.055 0.044 0.042 0.032 0.018
## CA13 CA14 CA15
## 0.012 0.009 0.005
plot(mdun)
Note here how the use of factors increases the rank of the constrained solution. If you have p levels of a factor, you will get p ??? 1 axes ??? provided your factor levels really are independent. If you only have continuous variables, you can constrain the ordination by adding a second argument to the call of cca or rda
With formula we have a full control of the model, but we face with the problem of model choice. Models must be built carefully, and preferably used to test specific hypotheses. Sometimes we may want to use automatic model building, but this must be done carefully. There are some shortcuts and tricks for this in vegan, but these should be used with utmost care. In automatic model building we usually need two extreme models: the small- est and the largest model considered. The following shortcuts build a model with all environmental variables and a model with no environmental variables, but both with a formula so that terms can be added or removed from the model
m1 <- cca(varespec ~ ., varechem)
m0 <- cca(varespec ~ 1, varechem)
m1
## Call: cca(formula = varespec ~ N + P + K + Ca + Mg + S + Al + Fe +
## Mn + Zn + Mo + Baresoil + Humdepth + pH, data = varechem)
##
## Inertia Proportion Rank
## Total 2.083 1.000
## Constrained 1.441 0.692 14
## Unconstrained 0.642 0.308 9
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for constrained axes:
## CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7 CCA8 CCA9 CCA10 CCA11 CCA12
## 0.439 0.292 0.163 0.142 0.118 0.089 0.070 0.058 0.031 0.013 0.008 0.007
## CCA13 CCA14
## 0.006 0.005
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8 CA9
## 0.1978 0.1419 0.1012 0.0708 0.0533 0.0333 0.0189 0.0151 0.0095
m0
## Call: cca(formula = varespec ~ 1, data = varechem)
##
## Inertia Rank
## Total 2.08
## Unconstrained 2.08 23
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8
## 0.525 0.357 0.234 0.195 0.178 0.122 0.115 0.089
## (Showed only 8 of all 23 unconstrained eigenvalues)
Then we can use step function to select the ???best??? model. The step function uses Akaike???s Information Criterion (AIC) in model choice. The AIC is based on the goodness of fit (high constrained inertia), but it is penalized by the number of estimated parameters (constrained rank). The alternative models are ordered by AIC. In each case, ???+??? indicated the effect of adding a term, and ???-??? the effect of removing a term, while the current model is marked as ???
m <- step(m0, scope=formula(m1), test="perm")
## Start: AIC=130
## varespec ~ 1
##
## Df AIC F Pr(>F)
## + Al 1 129 3.67 0.005 **
## + Mn 1 129 3.31 0.010 **
## + Humdepth 1 129 3.01 0.010 **
## + Baresoil 1 130 2.46 0.020 *
## + Fe 1 130 2.44 0.025 *
## + P 1 130 2.19 0.035 *
## + Zn 1 130 1.93 0.065 .
## <none> 130
## + Mg 1 130 1.87 0.035 *
## + K 1 130 1.86 0.085 .
## + Ca 1 130 1.80 0.075 .
## + pH 1 131 1.66 0.105
## + S 1 131 1.51 0.100 .
## + N 1 131 1.46 0.135
## + Mo 1 131 1.06 0.425
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=129
## varespec ~ Al
##
## Df AIC F Pr(>F)
## + P 1 128 2.50 0.010 **
## + K 1 128 2.32 0.010 **
## + S 1 128 2.16 0.015 *
## + Zn 1 128 1.99 0.020 *
## + Mn 1 128 1.89 0.030 *
## <none> 129
## + Mg 1 129 1.74 0.055 .
## + N 1 129 1.59 0.115
## + Baresoil 1 129 1.57 0.095 .
## + Ca 1 129 1.42 0.170
## + Humdepth 1 129 1.38 0.210
## + Mo 1 130 0.99 0.380
## + pH 1 130 0.88 0.590
## + Fe 1 130 0.52 0.880
## - Al 1 130 3.67 0.005 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=128
## varespec ~ Al + P
##
## Df AIC F Pr(>F)
## + K 1 127 2.17 0.035 *
## <none> 128
## + Baresoil 1 128 1.66 0.040 *
## + N 1 128 1.55 0.085 .
## + S 1 128 1.34 0.230
## + Mn 1 128 1.26 0.215
## + Zn 1 128 1.20 0.255
## + Humdepth 1 129 1.15 0.340
## - P 1 129 2.50 0.015 *
## + Mo 1 129 0.98 0.415
## + Mg 1 129 0.96 0.450
## + pH 1 129 0.92 0.540
## + Fe 1 129 0.53 0.895
## + Ca 1 129 0.46 0.900
## - Al 1 130 3.94 0.005 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=127
## varespec ~ Al + P + K
##
## Df AIC F Pr(>F)
## <none> 127
## + N 1 128 1.51 0.120
## + Baresoil 1 128 1.45 0.140
## + Zn 1 128 1.31 0.230
## + S 1 128 1.26 0.305
## - K 1 128 2.17 0.035 *
## + Mo 1 128 1.24 0.310
## - P 1 128 2.34 0.015 *
## + Mg 1 128 1.03 0.415
## + Mn 1 128 0.89 0.530
## + Humdepth 1 128 0.81 0.565
## + Fe 1 129 0.52 0.895
## + pH 1 129 0.51 0.905
## + Ca 1 129 0.44 0.930
## - Al 1 130 4.33 0.005 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With continuous variables, the model building often works rather well, but it should not be trusted blindly (if at all). You can see this if you use rda instead of cca, or dune and dune.env data sets instead of lichen pastures (you may try). Moreover, you may end up with different models if you change the modelling strategy. The following simplifies the maximum model:
mback <- step(m1, test="perm")
## Start: AIC=130
## varespec ~ N + P + K + Ca + Mg + S + Al + Fe + Mn + Zn + Mo +
## Baresoil + Humdepth + pH
##
## Df AIC F Pr(>F)
## - Fe 1 130 0.67 0.760
## <none> 130
## - Ca 1 130 0.87 0.540
## - N 1 130 0.88 0.570
## - Zn 1 130 0.90 0.550
## - pH 1 130 0.95 0.505
## - Al 1 130 0.96 0.560
## - Mo 1 131 1.09 0.390
## - Mn 1 131 1.14 0.340
## - Mg 1 131 1.24 0.235
## - Baresoil 1 131 1.26 0.260
## - P 1 132 1.45 0.190
## - K 1 132 1.46 0.175
## - S 1 132 1.46 0.155
## - Humdepth 1 133 1.90 0.055 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=130
## varespec ~ N + P + K + Ca + Mg + S + Al + Mn + Zn + Mo + Baresoil +
## Humdepth + pH
##
## Df AIC F Pr(>F)
## - Al 1 129 0.64 0.835
## - pH 1 129 0.69 0.725
## - N 1 130 0.76 0.755
## - Zn 1 130 0.78 0.685
## <none> 130
## - Ca 1 130 0.91 0.490
## - Mn 1 130 0.98 0.495
## - Mg 1 131 1.26 0.240
## - Mo 1 131 1.30 0.265
## - Baresoil 1 131 1.36 0.230
## - S 1 131 1.43 0.150
## - K 1 131 1.59 0.110
## - P 1 131 1.60 0.145
## - Humdepth 1 132 1.72 0.075 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=129
## varespec ~ N + P + K + Ca + Mg + S + Mn + Zn + Mo + Baresoil +
## Humdepth + pH
##
## Df AIC F Pr(>F)
## - N 1 129 0.73 0.745
## <none> 129
## - Ca 1 129 0.96 0.450
## - Mn 1 129 1.00 0.480
## - Zn 1 130 1.06 0.390
## - pH 1 130 1.06 0.370
## - Mg 1 130 1.32 0.175
## - Baresoil 1 130 1.41 0.160
## - K 1 130 1.58 0.090 .
## - P 1 131 1.62 0.110
## - Mo 1 131 1.70 0.105
## - Humdepth 1 131 1.82 0.055 .
## - S 1 131 2.01 0.020 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=129
## varespec ~ P + K + Ca + Mg + S + Mn + Zn + Mo + Baresoil + Humdepth +
## pH
##
## Df AIC F Pr(>F)
## - pH 1 128 0.89 0.505
## - Ca 1 129 1.03 0.405
## <none> 129
## - Zn 1 129 1.06 0.460
## - Mn 1 129 1.10 0.350
## - Baresoil 1 130 1.42 0.155
## - Mg 1 130 1.54 0.120
## - P 1 130 1.66 0.145
## - Humdepth 1 130 1.74 0.095 .
## - Mo 1 130 1.85 0.085 .
## - K 1 131 2.12 0.040 *
## - S 1 132 2.68 0.010 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=129
## varespec ~ P + K + Ca + Mg + S + Mn + Zn + Mo + Baresoil + Humdepth
##
## Df AIC F Pr(>F)
## - Ca 1 128 0.90 0.585
## - Zn 1 128 1.09 0.370
## - Mn 1 128 1.10 0.345
## <none> 128
## - Mg 1 129 1.51 0.105
## - Baresoil 1 129 1.62 0.130
## - P 1 129 1.69 0.100 .
## - Mo 1 130 1.70 0.065 .
## - Humdepth 1 130 1.86 0.065 .
## - K 1 130 2.20 0.010 **
## - S 1 131 2.68 0.005 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=128
## varespec ~ P + K + Mg + S + Mn + Zn + Mo + Baresoil + Humdepth
##
## Df AIC F Pr(>F)
## - Zn 1 128 1.07 0.465
## - Mg 1 128 1.18 0.330
## - Mn 1 128 1.20 0.305
## <none> 128
## - Baresoil 1 129 1.61 0.130
## - P 1 129 1.80 0.100 .
## - Humdepth 1 129 1.84 0.045 *
## - Mo 1 129 2.03 0.025 *
## - K 1 129 2.06 0.025 *
## - S 1 130 2.26 0.020 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=128
## varespec ~ P + K + Mg + S + Mn + Mo + Baresoil + Humdepth
##
## Df AIC F Pr(>F)
## <none> 128
## - Mn 1 128 1.45 0.155
## - Baresoil 1 128 1.60 0.100 .
## - P 1 129 1.83 0.065 .
## - Mg 1 129 1.92 0.045 *
## - Humdepth 1 129 1.97 0.030 *
## - K 1 129 2.02 0.050 *
## - S 1 129 2.13 0.020 *
## - Mo 1 129 2.31 0.025 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compare the order in which variables were added in the first model, and removed in this model.
The constrained ordination methods really do not have AIC, although it is calculated in step. Function ordistep uses significance test based on permutation instead of the non-existing AIC.3 The results of random permutations really are random, and therefore the model choice can have a random component. The usage is almost identical to the AIC based step:
m <- ordistep(m0, scope = formula(m1))
##
## Start: varespec ~ 1
##
## Df AIC F Pr(>F)
## + Al 1 129 3.67 0.005 **
## + Mn 1 129 3.31 0.005 **
## + Humdepth 1 129 3.01 0.005 **
## + Baresoil 1 130 2.46 0.010 **
## + Fe 1 130 2.44 0.015 *
## + P 1 130 2.19 0.035 *
## + Zn 1 130 1.93 0.050 *
## + Ca 1 130 1.80 0.055 .
## + Mg 1 130 1.87 0.060 .
## + K 1 130 1.86 0.060 .
## + S 1 131 1.51 0.095 .
## + pH 1 131 1.66 0.120
## + N 1 131 1.46 0.190
## + Mo 1 131 1.06 0.390
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: varespec ~ Al
##
## Df AIC F Pr(>F)
## - Al 1 130 3.67 0.01 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Df AIC F Pr(>F)
## + P 1 128 2.50 0.010 **
## + K 1 128 2.32 0.025 *
## + S 1 128 2.16 0.025 *
## + Zn 1 128 1.99 0.025 *
## + Mn 1 128 1.89 0.060 .
## + N 1 129 1.59 0.080 .
## + Mg 1 129 1.74 0.085 .
## + Baresoil 1 129 1.57 0.100 .
## + Ca 1 129 1.42 0.155
## + Humdepth 1 129 1.38 0.170
## + Mo 1 130 0.99 0.395
## + pH 1 130 0.88 0.590
## + Fe 1 130 0.52 0.910
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: varespec ~ Al + P
##
## Df AIC F Pr(>F)
## - P 1 129 2.50 0.015 *
## - Al 1 130 3.94 0.005 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Df AIC F Pr(>F)
## + K 1 127 2.17 0.040 *
## + N 1 128 1.55 0.075 .
## + Baresoil 1 128 1.66 0.100 .
## + S 1 128 1.34 0.185
## + Mn 1 128 1.26 0.230
## + Zn 1 128 1.20 0.275
## + Humdepth 1 129 1.15 0.315
## + pH 1 129 0.92 0.460
## + Mg 1 129 0.96 0.485
## + Mo 1 129 0.98 0.490
## + Fe 1 129 0.53 0.895
## + Ca 1 129 0.46 0.915
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: varespec ~ Al + P + K
##
## Df AIC F Pr(>F)
## - K 1 128 2.17 0.025 *
## - P 1 128 2.34 0.005 **
## - Al 1 130 4.33 0.005 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Df AIC F Pr(>F)
## + Baresoil 1 128 1.45 0.15
## + N 1 128 1.51 0.16
## + Mo 1 128 1.24 0.20
## + Zn 1 128 1.31 0.23
## + S 1 128 1.26 0.27
## + Mg 1 128 1.03 0.40
## + Humdepth 1 128 0.81 0.57
## + Mn 1 128 0.89 0.58
## + Ca 1 129 0.44 0.85
## + pH 1 129 0.51 0.87
## + Fe 1 129 0.52 0.92
m$anova
One problem with model building is that constraining variables are not in- dependent, but they are correlated. Any correlated variable can be explained with other variables. Such variables are redundant (???expendable???) when they are with other variables, but they may be the best variables alone and prevent other variables from entering the model. A statistic describing this is called variance inflation factor (VIF) which is 1 for completely independent variables, and values above 10 or 20 (depending on your taste) are regarded as highly multicollinear (dependent on others). The VIF of a variable will depend on the set it sits with:
vif.cca(m1)
## N P K Ca Mg S Al Fe
## 1.98 6.03 12.01 9.93 9.81 18.38 21.19 9.13
## Mn Zn Mo Baresoil Humdepth pH
## 5.38 7.74 4.32 2.25 6.01 7.39
vif.cca(m)
## Al P K
## 1.01 2.37 2.38
vif.cca(mback)
## P K Mg S Mn Mo Baresoil Humdepth
## 3.93 7.63 2.86 9.25 3.47 2.45 1.68 2.84
You can manually build models if the automatic procedure fails. For instance, it does not work for Dutch dunes, where the building stops too early:
m0 <- cca(dune ~ 1, dune.env)
m0
## Call: cca(formula = dune ~ 1, data = dune.env)
##
## Inertia Rank
## Total 2.12
## Unconstrained 2.12 19
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8
## 0.536 0.400 0.260 0.176 0.145 0.108 0.092 0.081
## (Showed only 8 of all 19 unconstrained eigenvalues)
m1 <- cca(dune ~ ., dune.env)
m1
## Call: cca(formula = dune ~ A1 + Moisture + Management + Use +
## Manure, data = dune.env)
##
## Inertia Proportion Rank
## Total 2.115 1.000
## Constrained 1.503 0.711 12
## Unconstrained 0.612 0.289 7
## Inertia is mean squared contingency coefficient
## Some constraints were aliased because they were collinear (redundant)
##
## Eigenvalues for constrained axes:
## CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7 CCA8 CCA9 CCA10 CCA11 CCA12
## 0.467 0.341 0.176 0.153 0.095 0.070 0.059 0.050 0.032 0.026 0.023 0.011
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7
## 0.2724 0.1088 0.0897 0.0631 0.0349 0.0253 0.0180
m <- step(m0, scope=formula(m1), test="p")
## Start: AIC=87.7
## dune ~ 1
##
## Df AIC F Pr(>F)
## + Moisture 3 86.6 2.25 0.005 **
## + Management 3 86.9 2.13 0.005 **
## + A1 1 87.4 2.14 0.030 *
## <none> 87.7
## + Manure 4 88.8 1.53 0.025 *
## + Use 2 89.1 1.14 0.190
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=86.6
## dune ~ Moisture
##
## Df AIC F Pr(>F)
## <none> 86.6
## + Management 3 86.8 1.46 0.050 *
## + A1 1 87.0 1.26 0.175
## + Use 2 87.3 1.28 0.100 .
## + Manure 4 87.3 1.31 0.085 .
## - Moisture 3 87.7 2.25 0.005 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m
## Call: cca(formula = dune ~ Moisture, data = dune.env)
##
## Inertia Proportion Rank
## Total 2.115 1.000
## Constrained 0.628 0.297 3
## Unconstrained 1.487 0.703 16
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for constrained axes:
## CCA1 CCA2 CCA3
## 0.419 0.133 0.077
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8 CA9 CA10 CA11 CA12
## 0.410 0.226 0.176 0.123 0.108 0.091 0.086 0.061 0.057 0.047 0.042 0.020
## CA13 CA14 CA15 CA16
## 0.014 0.010 0.009 0.008
You can use update command where you change the formula. The retained parts of the formula are shown by a dot (.) and terms are added with + or removed with -. The following keeps the dependent variable (left hand side) and all previously entered constraints (hence ???. ~ .???), and adds Management to the previous model:
m <- update(m, . ~ . + Management)
m
## Call: cca(formula = dune ~ Moisture + Management, data = dune.env)
##
## Inertia Proportion Rank
## Total 2.115 1.000
## Constrained 1.002 0.474 6
## Unconstrained 1.113 0.526 13
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for constrained axes:
## CCA1 CCA2 CCA3 CCA4 CCA5 CCA6
## 0.446 0.289 0.112 0.072 0.049 0.034
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8 CA9 CA10 CA11 CA12
## 0.350 0.152 0.125 0.110 0.092 0.077 0.059 0.048 0.037 0.022 0.021 0.011
## CA13
## 0.008
You can see if any other terms should be added to this model, or if removing an included term could improve the model if constraints are correlated, the significance can change with adding or removing variables from the model:
add1(m, scope=formula(m1), test="perm")
drop1(m, test="perm")
If needed, you can manually continue model building.
We already saw significance tests with model building. These tests were based on permutation: there is no known distribution of inertia that could be used for strict statistical testing. We simply permute the rows of community data, repeat the analysis, and get a random result. If our observed result is better than most of the random models (say, better than 95% of them), we say that our results are significant. Package vegan has several alternative types of significance tests. They all can be performed with a function called anova. The name is somewhat misleading: the test are based on permutations although the layout of the results is similar as in the standard ANOVA table. The default is an overall test of all variables together
anova(m)
We actually used function anova.cca, but you do not need to give its name in full, because R automatically chooses the correct anova variant for the result of constrained ordination. This kind of functions that are automatically adapted to the result type are called method functions. We have already seen many of them: for instance plot and summary work differently for different types of results. The anova.cca function tries to be clever and lazy: it automatically stops if the observed permutation significance probably differs from the targeted critical value (P = 0.05 as default), but it will continue long in uncertain cases. You must set step and perm.max to same values to override this behaviour. It is also possible to analyse terms separately:
anova(m, by="term", permu=200)
In this case, the function is unable to automatically select the number of iterations. This test is sequential: the terms are analysed in the order they happen to be in the model.
Moreover, it is possible to analyse significance of each axis:
anova(m, by="axis", perm=500)
Now the automatic selection works, but typically some of your axes will be very close to the critical value, and it may be useful to set a lower perm.max than the default 10000 (typically you use higher limits than in these examples: we used lower limits to save time when this document is automatically generated).
All constrained ordination methods can have terms that are partialled out from the analysis before constraints:
m <- cca(dune ~ Management + Condition(Moisture + A1), data=dune.env)
m
## Call: cca(formula = dune ~ Management + Condition(Moisture + A1),
## data = dune.env)
##
## Inertia Proportion Rank
## Total 2.115 1.000
## Conditional 0.744 0.352 4
## Constrained 0.395 0.187 3
## Unconstrained 0.976 0.461 12
## Inertia is mean squared contingency coefficient
##
## Eigenvalues for constrained axes:
## CCA1 CCA2 CCA3
## 0.2385 0.0865 0.0704
##
## Eigenvalues for unconstrained axes:
## CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8 CA9 CA10
## 0.3064 0.1319 0.1152 0.1095 0.0772 0.0758 0.0487 0.0376 0.0311 0.0210
## CA11 CA12
## 0.0125 0.0093
This partials out the effect of Moisture and A1 before analysing the effects of Management. This also influences the significances of the terms:
anova(m, by="term", perm=500)
If we had a designed experiment, we may wish to restrict the permutations so that the observations only are permuted within levels of strata:
with(dune.env, anova(m, by="term", perm=500, strata=Moisture))
Here with() is a special function that makes variables in dune.env visible to the following command. If you only type Moisture in an R prompt, you will get an error of missing variables. Functions with formulae have a data argument giving the name of the data frame from which the variables are found, but other functions usually do not have such an argument. Instead of with(dune.env, command()), you can first attach(dune.env) and after that all variables in the data frame are visible in the session. This may be dangerous if you have similar names in your session and several attached data frames: it is difficult to know which of these was used.
The basic partial ordination gives decomposition of the inertia into conditional (partialled out) and (residual) constrained component. The decomposition is sequential: conditions are partialled out before analysing constraints. Reversing the order of terms changes the components. Often we want to have a neutral decomposition of variation into unique and shared components of several sources. For this we can use function varpart of vegan which can handle upto four sources of variation. Function varpart partitions adjusted R2. Unlike ordinary R2 or variance, adjusted R2 is unbiased and its expected value is R2 = 0 for random data. However, it can be easily calculated only for RDA. In principle, CCA can also be used, but calculations are lengthy and complicated, and no software is implemented for them.
The example above can be analysed using command:
mod <- varpart(dune, ~ Management, ~ A1 + Moisture, data = dune.env)
mod
##
## Partition of variance in RDA
##
## Call: varpart(Y = dune, X = ~Management, ~A1 + Moisture, data =
## dune.env)
##
## Explanatory tables:
## X1: ~Management
## X2: ~A1 + Moisture
##
## No. of explanatory tables: 2
## Total variation (SS): 1598.4
## Variance: 84.124
## No. of observations: 20
##
## Partition table:
## Df R.squared Adj.R.squared Testable
## [a+b] = X1 3 0.34747 0.22512 TRUE
## [b+c] = X2 4 0.35382 0.18150 TRUE
## [a+b+c] = X1+X2 7 0.58105 0.33666 TRUE
## Individual fractions
## [a] = X1|X2 3 0.15515 TRUE
## [b] 0 0.06997 FALSE
## [c] = X2|X1 4 0.11153 TRUE
## [d] = Residuals 0.66334 FALSE
## ---
## Use function 'rda' to test significance of fractions of interest
The first argument (dune) gives the dependent data, and the next two to four arguments give the sources of variation. These can be single variables, matrices or one-sided model formulae like above.
The output can be tricky to read: names X1, X2 etc refer to the original sources. These are divided into unique and shared components. In our example, the source X1 (Management) has two components: unique component a and a component b that is shared with the source X2. Source X2 consists of its unique component c and the same shared component b. Consquently, X1 = a + b and X2 = b + c. The unique component is the variation that can be explained only by the source and not by other source, and it is shown with operator | in the output: a = X1 | X2. The lower case letters can be deciphered using:
showvarparts(2)
showvarparts(3)
and the decomposition can be displayed in the same diagrams with
plot(mod)
The decomposition can be done by hand:
ab <- rda(dune ~ Management, dune.env)
bc <- rda(dune ~ A1 + Moisture, dune.env)
abc <- rda(dune ~ Management + A1 + Moisture, dune.env)
a <- rda(dune ~ Management + Condition(A1 + Moisture), dune.env)
c <- rda(dune ~ A1 + Moisture + Condition(Management), dune.env)
However, the shared component b cannot be found with a direct model. All the models above define testable models, but component b is non-testable and can be only found indirectly as a difference of the models. The components of variation can be negative for various reasons. Simplest reasons is that they are negative because of random variation in adjusted R2, but there can be other, more complicated reasons (like multivariate non-linear dependence between sources of variation).